(?{ code }) block works fine in child rule but not in parent

am 10.09.2007 20:02:30 von Clint Olsen

I'm trying to write a lexical analyzer in Perl, and I was trying to take
advantage of the above construct to simplify and be able to reuse rules as
much as possible. My script looks like:

------

#!/usr/bin/perl

use strict;
use warnings;

my $foo;

my $escaped_identifier = qr/\\(\S+)(?=\s)/o;
my $simple_identifier = qr/([a-zA-Z_][a-zA-Z0-9_\$]*)/o;
my $identifier = qr/ ($simple_identifier
| $escaped_identifier)
(?{ $foo = $^N })
/xo;

$_ = '\\blah!$@%$!^ ';

if (s/$identifier//o) {
print "Match is $foo\n";
}

$_ = "identifier";

if (s/$identifier//o) {
print "Match is $foo\n";
}

------

However, in this config, I get the runtime error:

Eval-group not allowed at runtime, use re 'eval' in regex m/ ((?-xism:([a-zA-Z_][a-zA-Z0-9_\$]*))
| (?-xism:\\(\S+)(?=\s)))
(?{ $foo =.../ at /tmp/re line 10.

But when I had it written like:

------

#!/usr/bin/perl

use strict;
use warnings;

my $foo;

my $escaped_identifier = qr/\\(\S+)(?=\s)(?{ $foo = $^N })/o;
my $simple_identifier = qr/([a-zA-Z_][a-zA-Z0-9_\$]*)(?{ $foo = $^N })/o;
my $identifier = qr/ $simple_identifier
| $escaped_identifier
/xo;

$_ = '\\blah!$@%$!^ ';

if (s/$identifier//o) {
print "Match is $foo\n";
}

$_ = "identifier";

if (s/$identifier//o) {
print "Match is $foo\n";
}

------

I get:

Match is blah!$@%$!^
Match is identifier

As with anything with Perl, the answer is buried in the millions of lines
of source code they call a language. Any cluepons would be greatly
appreciated.

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 20:21:12 von xhoster

Clint Olsen wrote:
>
> However, in this config, I get the runtime error:
>
> Eval-group not allowed at runtime, use re 'eval' in regex m/
> ((?-xism:([a-zA-Z_][a-zA-Z0-9_\$]*))
> | (?-xism:\\(\S+)(?=\s)))
> (?{ $foo =.../ at /tmp/re line 10.

Did you try adding a "use re 'eval';" like the error message said?

....

> As with anything with Perl, the answer is buried in the millions of lines
> of source code they call a language.

Apparently the problem is that you think the answer is to start out by
reading the source code rather than the documentation.

The behavior you describe is documented in both perldoc perlre and perldoc
re.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 20:35:02 von Clint Olsen

On 2007-09-10, xhoster@gmail.com wrote:
> Apparently the problem is that you think the answer is to start out by
> reading the source code rather than the documentation.
>
> The behavior you describe is documented in both perldoc perlre and
> perldoc re.

Yes, I read that section, but I'm not relying on any runtime interpolation
to get my work done (or did I misread something?). I want to avoid using
switches that are 'perilous' when it isn't required. There isn't a way to
convince the interpreter that I'm not relying on this behavior?

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 21:50:14 von Ben Morrow

Quoth Clint Olsen :
> On 2007-09-10, xhoster@gmail.com wrote:
> > Apparently the problem is that you think the answer is to start out by
> > reading the source code rather than the documentation.
> >
> > The behavior you describe is documented in both perldoc perlre and
> > perldoc re.
>
> Yes, I read that section, but I'm not relying on any runtime interpolation
> to get my work done (or did I misread something?).

But you are. From your original post:

| my $escaped_identifier = qr/\\(\S+)(?=\s)/o;
| my $simple_identifier = qr/([a-zA-Z_][a-zA-Z0-9_\$]*)/o;
| my $identifier = qr/ ($simple_identifier
| | $escaped_identifier)
| (?{ $foo = $^N })
| /xo;

This last regex contains both interpolation and code escapes which do
not come from a qr//. This is what is forbidden unless you have re
'eval'. The documentation does say this, but it is not entirely clear.
(Patches welcome :) ).

The solution (if you want to avoid re 'eval', which is a good idea) is
to precompile the code assertion into a qr// as well:

my $escape_id = qr/\\(\S+)(?=\s)/;
my $simple_id = qr/([a-zA-Z_][a-zA-Z0-9_\$]*/;
my $capture = qr/(?{ $foo = $^N })/;
my $id = qr/ ($simple_id | $escape_id) $capture /x;

Note that all of your '/o's are redundant, as you are using qr//.

| As with anything with Perl, the answer is buried in the millions of
| lines of source code they call a language.

FWIW, insulting Perl in a Perl newsgroup is not likely to be a way to
get useful advice...

Ben

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 21:50:33 von Abigail

_
Clint Olsen (clint.olsen@gmail.com) wrote on VCXXIII September MCMXCIII
in :
&& On 2007-09-10, xhoster@gmail.com wrote:
&& > Apparently the problem is that you think the answer is to start out by
&& > reading the source code rather than the documentation.
&& >
&& > The behavior you describe is documented in both perldoc perlre and
&& > perldoc re.
&&
&& Yes, I read that section, but I'm not relying on any runtime interpolation
&& to get my work done (or did I misread something?). I want to avoid using
&& switches that are 'perilous' when it isn't required. There isn't a way to
&& convince the interpreter that I'm not relying on this behavior?

Yes, it's called "use re 'eval'".

Abigail
--
perl5.004 -wMMath::BigInt -e'$^V=Math::BigInt->new(qq]$^F$^W783$[$%9889$^F47]
..qq]$|88768$^W596577669$%$^W5$^F3364$[$^W$^F$|838747$[88897 39$%$|$^F673$%$^W]
..qq]98$^F76777$=56]);$^U=substr($]=>$|=>5)*(q.25..($^W=@^V) )=>do{print+chr$^V
%$^U;$^V/=$^U}while$^V!=$^W'

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 22:30:51 von Clint Olsen

On 2007-09-10, Ben Morrow wrote:
> This last regex contains both interpolation and code escapes which do not
> come from a qr//. This is what is forbidden unless you have re 'eval'.
> The documentation does say this, but it is not entirely clear. (Patches
> welcome :) ).
>
> The solution (if you want to avoid re 'eval', which is a good idea) is
> to precompile the code assertion into a qr// as well:
>
> my $escape_id = qr/\\(\S+)(?=\s)/;
> my $simple_id = qr/([a-zA-Z_][a-zA-Z0-9_\$]*/;
> my $capture = qr/(?{ $foo = $^N })/;
> my $id = qr/ ($simple_id | $escape_id) $capture /x;
>
> Note that all of your '/o's are redundant, as you are using qr//.

Yeah, this is an old habit. Thanks for the reminder. What'd I'd like to
do is to be able to use combinations of $^N and $^R to be able to roll up
lexical subexpressions in a nice way so that I don't have to rewrite stuff
and also be able to have an entire RE that describes every lexical
possibility ala what lex/flex would do if you were constructing a lexical
analyzer. So, that would also mean using the minimum required capture
buffers to extract the tokens and any state necessary.

> FWIW, insulting Perl in a Perl newsgroup is not likely to be a way to get
> useful advice...

Point taken. However, I often to subscribe to the notion that I both love
and hate Perl simulataneously. It's really awesome when it works well, and
a sonofabitch to debug otherwise :)

Thanks to all of you for your help and suggestions.

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 22:45:14 von xhoster

Ben Morrow wrote:
> Quoth Clint Olsen :
> > On 2007-09-10, xhoster@gmail.com wrote:
> > > Apparently the problem is that you think the answer is to start out
> > > by reading the source code rather than the documentation.
> > >
> > > The behavior you describe is documented in both perldoc perlre and
> > > perldoc re.
> >
> > Yes, I read that section, but I'm not relying on any runtime
> > interpolation to get my work done (or did I misread something?).
>
> But you are. From your original post:
>
> | my $escaped_identifier = qr/\\(\S+)(?=\s)/o;
> | my $simple_identifier = qr/([a-zA-Z_][a-zA-Z0-9_\$]*)/o;
> | my $identifier = qr/ ($simple_identifier
> | | $escaped_identifier)
> | (?{ $foo = $^N })
> | /xo;
>
> This last regex contains both interpolation and code escapes which do
> not come from a qr//. This is what is forbidden unless you have re
> 'eval'.

Does this requirement make sense? Why does it matter if some part
of the regex which *isn't* the code part comes from an interpolation?
Was this just easier to implement than whatever makes more sense
would be?

> The documentation does say this, but it is not entirely clear.

I would argue that it is entirely anti-clear.

For the purpose of this pragma, interpolation of precom-
piled regular expressions (i.e., the result of "qr//") is
not considered variable interpolation.

For reasons of security, this construct is for-
bidden if the regular expression involves run-
time interpolation of variables, unless the per-
ilous "use re 'eval'" pragma has been used (see
re), or the variables contain results of "qr//"
operator (see "qr/STRING/imosx" in perlop).

But all of the interpolated variables in his example do contain the results
of qr//.

It seems clear but the apparently clear meaning is not correct.

> (Patches welcome :) ).

I'm no longer confident that I know what it does, so I don't know
what it should say.

For reasons of security, this construct is for-
bidden if the regular expression involves run-
time interpolation of variables, unless the per-
ilous "use re 'eval'" pragma has been used (see
re), even if those variables are results of "qr//".
However, variables containing qr// compiled forms of this
construct can themselves be interpolated into other
regular expressions which involve other interpolations.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 22:53:28 von xhoster

Clint Olsen wrote:
> On 2007-09-10, xhoster@gmail.com wrote:
> > Apparently the problem is that you think the answer is to start out by
> > reading the source code rather than the documentation.
> >
> > The behavior you describe is documented in both perldoc perlre and
> > perldoc re.
>
> Yes, I read that section, but I'm not relying on any runtime
> interpolation to get my work done (or did I misread something?).

Ah, now I see. Now I prefer your reading of the docs to my reading of the
docs (or I would if it weren't for the sad fact that the worse reading
seems to be correct one).

> I want
> to avoid using switches that are 'perilous' when it isn't required.

Maybe I'm missing something here, but I would argue that if you are running
in a hostile environment, it isn't enough to refuse to use re 'eval', you
should also run under taint. And once you use taint appropriately, I don't
see why use re 'eval' would be perilous.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: (?{ code }) block works fine in child rule but not in parent

am 10.09.2007 23:52:33 von Clint Olsen

On 2007-09-10, xhoster@gmail.com wrote:
> Ah, now I see. Now I prefer your reading of the docs to my reading of
> the docs (or I would if it weren't for the sad fact that the worse
> reading seems to be correct one).
>
> Maybe I'm missing something here, but I would argue that if you are
> running in a hostile environment, it isn't enough to refuse to use re
> 'eval', you should also run under taint. And once you use taint
> appropriately, I don't see why use re 'eval' would be perilous.

Well, what I'm writing is just a parser (lexer for this part). It's not
technically a 'hostile' environment in the sense of worrying about someone
trying to execute rogue code, but I just hesitate when I see warnings like
this.

I don't think taint applies in my application. At least intuitively I
don't believe it does. The example uses $^X which isn't immediately
obvious why $x is tainted. I definitely want to extract substrings from my
substitutions since this is the lexical analysis phase of the parsing. I'm
only using s/// because it performs much much better than m/// with the \G
assertion.

Thanks,

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 01:57:44 von Ben Morrow

Quoth xhoster@gmail.com:
> Ben Morrow wrote:
> >
> > This last regex contains both interpolation and code escapes which do
> > not come from a qr//. This is what is forbidden unless you have re
> > 'eval'.
>
> Does this requirement make sense? Why does it matter if some part
> of the regex which *isn't* the code part comes from an interpolation?

It doesn't... :)

> Was this just easier to implement than whatever makes more sense
> would be?

AFAICS (and the guts of the regex engine are *very* hard to follow) it
is a consequence of Perl's regexen doing two-fold interpolation. First
the qr is stringified and interpolated, and then the result is compiled.
Perl has no way of knowing which bits came from where. (Any ideas anyone
may have had about qrs being more efficient when interpolated into other
regexen later are, unfortunately, false.[1])

However, when a qr is interpolated, it makes a record of how many eval
groups it contained; then when the regex engine compiles an eval group,
it checks to see whether it has met more eval groups so far than have
been interpolated from qrs; if so, it throws the 'Eval-group not
allowed' error. This is, of course, horribly crude, but it's hard to see
what else could be done without completely re-working the way the regex
engine operates.

[1] Although this may change in 5.10. I understand a lot of work has
gone into the regex engine; in part making qrs re-use their compiled
form more often. I'm afraid I don't know the details...

> > The documentation does say this, but it is not entirely clear.
>
> I would argue that it is entirely anti-clear.

Heh. Yes, I agree. That was an understatement... :)

> > (Patches welcome :) ).
>
> I'm no longer confident that I know what it does, so I don't know
> what it should say.
>
> For reasons of security, this construct is for-
> bidden if the regular expression involves run-
> time interpolation of variables, unless the per-
> ilous "use re 'eval'" pragma has been used (see
> re), even if those variables are results of "qr//".
> However, variables containing qr// compiled forms of this
> construct can themselves be interpolated into other
> regular expressions which involve other interpolations.

For reasons of security, this construct is forbidden if the regular
expression contains variable interpolations, unless it results from
the interpolation of a C, or C is in effect.

Ben

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 03:05:49 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to

], who wrote in article <20070910164516.841$ux@newsreader.com>:
> > This last regex contains both interpolation and code escapes which do
> > not come from a qr//. This is what is forbidden unless you have re
> > 'eval'.
>
> Does this requirement make sense? Why does it matter if some part
> of the regex which *isn't* the code part comes from an interpolation?
> Was this just easier to implement than whatever makes more sense
> would be?

Right. Basically, it was "either I make this feature secure quick, or
it won't make it into v5.6". The "proper" solution would mean a MAJOR
rehaul of how REx engine interacts with the lexer.

Hope this helps,
Ilya

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 07:37:24 von Martijn Lievaart

On Mon, 10 Sep 2007 15:30:51 -0500, Clint Olsen wrote:

> Yeah, this is an old habit. Thanks for the reminder. What'd I'd like
> to do is to be able to use combinations of $^N and $^R to be able to
> roll up lexical subexpressions in a nice way so that I don't have to
> rewrite stuff and also be able to have an entire RE that describes every
> lexical possibility ala what lex/flex would do if you were constructing
> a lexical analyzer. So, that would also mean using the minimum required
> capture buffers to extract the tokens and any state necessary.

Did you look at Parser::Decent? Seems like the perfect tool for this job.

M4

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 08:03:22 von Clint Olsen

On 2007-09-11, Martijn Lievaart wrote:
> Did you look at Parser::Decent? Seems like the perfect tool for this job.

I've had really great luck with Parse::Yapp for that. Are you speaking of
Damian Conway's Parse::RecDescent? The problem with recursive descent
parsing is that it's not as expressive as an LALR parser. So, what's left
is writing the lexer part.

Thanks,

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 08:25:24 von Brian Helterlilne

Martijn Lievaart wrote:
> Did you look at Parser::Decent? Seems like the perfect tool for this job.

Parse::RecDescent

--
brian

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 10:59:23 von Abigail

_
xhoster@gmail.com (xhoster@gmail.com) wrote on VCXXIII September MCMXCIII
in :
{} Clint Olsen wrote:
{} > On 2007-09-10, xhoster@gmail.com wrote:
{} > > Apparently the problem is that you think the answer is to start out by
{} > > reading the source code rather than the documentation.
{} > >
{} > > The behavior you describe is documented in both perldoc perlre and
{} > > perldoc re.
{} >
{} > Yes, I read that section, but I'm not relying on any runtime
{} > interpolation to get my work done (or did I misread something?).
{}
{} Ah, now I see. Now I prefer your reading of the docs to my reading of the
{} docs (or I would if it weren't for the sad fact that the worse reading
{} seems to be correct one).
{}
{} > I want
{} > to avoid using switches that are 'perilous' when it isn't required.
{}
{} Maybe I'm missing something here, but I would argue that if you are running
{} in a hostile environment, it isn't enough to refuse to use re 'eval', you
{} should also run under taint. And once you use taint appropriately, I don't
{} see why use re 'eval' would be perilous.

Well, for starters, /$tainted_variable/ doesn't trigger an exception under -T.

The problem is code that was written before (?{ .. }) existed. Before
(?{ .. }) or (??{ .. }) existed, /$got_this_from_the_environment/ couldn't
do anything more harmful than eat up resources.

So, pre 5.6 code that used /$got_this_from_the_environment/ may not
even use -T, and still be secure. Those programs would suddenly become
a security hole if the perl on the system was upgraded. Also, making
/$got_this_from_the_environment/ throw an exception under -T would have
caused programs that ran under -T already throw a fit if the perl was
updated. Even if there was no potential security issue.

Hence "use re 'eval'" was born.

The regexp engine of 5.10 will have many new features; some of the current
usages of (?{ .. }) can be replaced by the new features. I do not what the
intended usage of (?{ .. }) in the example given by the OP is, but is might
very well be that the same could have done with named capture buffers and
%+ or %- in 5.10.

Abigail

Abigail
--
A perl rose: perl -e '@}>-`-,-`-%-'

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 15:14:49 von Ben Morrow

Quoth abigail@abigail.be:
> _
> xhoster@gmail.com (xhoster@gmail.com) wrote on VCXXIII September MCMXCIII
> in :
> {} Clint Olsen wrote:
> {} > On 2007-09-10, xhoster@gmail.com wrote:
> {} > > Apparently the problem is that you think the answer is to start out by
> {} > > reading the source code rather than the documentation.
> {} > >
> {} > > The behavior you describe is documented in both perldoc perlre and
> {} > > perldoc re.
> {} >
> {} > Yes, I read that section, but I'm not relying on any runtime
> {} > interpolation to get my work done (or did I misread something?).
> {}
> {} Ah, now I see. Now I prefer your reading of the docs to my reading of the
> {} docs (or I would if it weren't for the sad fact that the worse reading
> {} seems to be correct one).
> {}
> {} > I want
> {} > to avoid using switches that are 'perilous' when it isn't required.
> {}
> {} Maybe I'm missing something here, but I would argue that if you are running
> {} in a hostile environment, it isn't enough to refuse to use re 'eval', you
> {} should also run under taint. And once you use taint appropriately, I don't
> {} see why use re 'eval' would be perilous.
>
> Well, for starters, /$tainted_variable/ doesn't trigger an exception under -T.

It does if the result contains any code assertions:

% perl -T -Mre=eval -e'/$^X (?{1;})/'
Eval-group in insecure regular expression in regex m/perl (?{1;})/
at -e line 1.

so, *if* you are tainting, re 'eval' should be safe.

> The regexp engine of 5.10 will have many new features; some of the current
> usages of (?{ .. }) can be replaced by the new features. I do not what the
> intended usage of (?{ .. }) in the example given by the OP is, but is might
> very well be that the same could have done with named capture buffers and
> %+ or %- in 5.10.

I think this is very likely.

Ben

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 18:59:24 von xhoster

Ben Morrow wrote:
> Quoth xhoster@gmail.com:
> > Ben Morrow wrote:
> > >
> > > This last regex contains both interpolation and code escapes which do
> > > not come from a qr//. This is what is forbidden unless you have re
> > > 'eval'.
> >
> > Does this requirement make sense? Why does it matter if some part
> > of the regex which *isn't* the code part comes from an interpolation?
>
> It doesn't... :)
>
> > Was this just easier to implement than whatever makes more sense
> > would be?
>
> AFAICS (and the guts of the regex engine are *very* hard to follow) it
> is a consequence of Perl's regexen doing two-fold interpolation. First
> the qr is stringified and interpolated, and then the result is compiled.
> Perl has no way of knowing which bits came from where. (Any ideas anyone
> may have had about qrs being more efficient when interpolated into other
> regexen later are, unfortunately, false.[1])

Was there ever a time when qr// was the only way to have different parts
of the pattern have different "global" options, or was qr// introduced
after (?-xism:...) was introduced?

>
> However, when a qr is interpolated, it makes a record of how many eval
> groups it contained; then when the regex engine compiles an eval group,
> it checks to see whether it has met more eval groups so far than have
> been interpolated from qrs; if so, it throws the 'Eval-group not
> allowed' error.

If the construct has no interpolations at all, yet does have eval
groups, it will encounter more eval groups than have been interpolated
from qr//. Is it the IN_PERL_RUNTIME that handles that part? And
which of these variables' behaviors is changed by use re 'eval'?

else { /* First pass */
if (PL_reginterp_cnt < ++RExC_seen_evals
&& IN_PERL_RUNTIME)
/* No compiled RE interpolated, has runtime
components ===> unsafe. */
FAIL("Eval-group not allowed at runtime, use re
'eval'"); if (PL_tainting && PL_tainted)
FAIL("Eval-group in insecure regular expression");
}

> This is, of course, horribly crude, but it's hard to see
> what else could be done without completely re-working the way the regex
> engine operates.

I think you could just add another flag:
if (PL_reginterp_cnt < ++RExC_seen_evals
&& IN_PERL_RUNTIME && AT_LEAST_ONE_NON_QR_INTERPOLATION)

Of course, assuring the flag is set appropriately would be the hard part,
surely beyond my competence. And if 5.10 is going to be quite different,
then I guess there is no point.

> >
> > I'm no longer confident that I know what it does, so I don't know
> > what it should say.
> >
> > For reasons of security, this construct is for-
> > bidden if the regular expression involves run-
> > time interpolation of variables, unless the per-
> > ilous "use re 'eval'" pragma has been used (see
> > re), even if those variables are results of "qr//".
> > However, variables containing qr// compiled forms of
> > this construct can themselves be interpolated into
> > other regular expressions which involve other
> > interpolations.
>
> For reasons of security, this construct is forbidden if the regular
> expression contains variable interpolations, unless it results from
> the interpolation of a C, or C is in effect.

That is still confusing to me. I keep reading the antecedent of
the pronoun in "unless it results from" as being the interpolation, not the
code construct. I (now) know that this is the incorrect way to read it,
yet still that is how I read it when I approach it from the viewpoint of
someone who doesn't already know the answer.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 19:01:59 von Clint Olsen

On 2007-09-11, Abigail wrote:
> The regexp engine of 5.10 will have many new features; some of the
> current usages of (?{ .. }) can be replaced by the new features. I do not
> what the intended usage of (?{ .. }) in the example given by the OP is,
> but is might very well be that the same could have done with named
> capture buffers and %+ or %- in 5.10.

The intent of using the (?{ code }) blocks is so that I can construct one
massive RE for the entire lexical specification of the language ala what
lex/flex would do when you write a spec file. So, at the end of a match
(and we /should/ match under normal circumstances), I have the token type
and any special processing all taken care of by the time I fall into the
code block.

Thanks,

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 19:47:01 von Martijn Lievaart

On Mon, 10 Sep 2007 23:25:24 -0700, Brian Helterlilne wrote:

> Martijn Lievaart wrote:
>> Did you look at Parser::Decent? Seems like the perfect tool for this
>> job.
>
> Parse::RecDescent

Sorry, thx for the correction.

M4

Re: Do my variables scoped to a subroutine get reconstructed every time?

am 11.09.2007 20:14:08 von Clint Olsen

In the spirit of this post, I was curious as to whether the my variables
that are the result of a qr/// are reconstructed every time a subroutine is
executed. The reason is that some of my code blocks in the qr/// I want to
affect lexically scoped variables in the sub.

use strict;
use warnings;

my $simple_identifier = qr/([a-zA-Z_][a-zA-Z0-9_\$]*)/;
my $escaped_identifier = qr/\\(\S+)(?=\s)(?{ $x += 2 })/;
my $identifier = qr/$simple_identifier|$escaped_identifier/;

#
# scan
#
# Given a string, apply the pattern and return a token ID and lvalue data.
#
sub scan {
my ($data,$y,$x) = @_;

if ($$data =~ s/
$identifier
//x) {
$x += len $^N;
}
}

print "Hello\n";

This code won't compile because $escaped_identifier references $x. I have
these outside the scan scope for fear that calling scan millions of times
could cause performance issues. If it turns out that the interpreter can
determine these are 'constant' expressions and calculates them only once,
then great. I'll put it back inside the subroutine.

Thanks,

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 11.09.2007 22:11:57 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to

], who wrote in article <20070911125925.933$Iq@newsreader.com>:
> Was there ever a time when qr// was the only way to have different parts
> of the pattern have different "global" options, or was qr// introduced
> after (?-xism:...) was introduced?

If you want to know the history: the original form of qr// was called

study m//;

;-) (I was too lazy to add a new syntax to the lexer). In
this form, it existed for quite some time (several months?); almost
immediately, the understanding emerged that the effects (but not
speed) were similar to "a new quoting operator". (I do not know how
soon the interpolation of the result into strings appeared; the main
initial purpose was a speedup due to compile-once use-often.)

After it was redone as qr// (not by me), it may have existed (for
several days?) without support of (?-xism:). Again, the understanding
appeared that its use without (?-xism:) is too confusing; so this
feature was implemented.

Hope this helps,
Ilya

Re: (?{ code }) block works fine in child rule but not in parent

am 12.09.2007 03:31:22 von Ben Morrow

Quoth xhoster@gmail.com:
>
> Was there ever a time when qr// was the only way to have different parts
> of the pattern have different "global" options, or was qr// introduced
> after (?-xism:...) was introduced?

No, that doesn't work.... a qr// is stringified before it's
interpolated. It *cannot* preserve its options unless it has the
(?xism-:) syntax to preserve them into; since Ilya says that syntax was
introduced for the sake of qr//s, they must have originally interpolated
with different semantics from when they were compiled.

> > However, when a qr is interpolated, it makes a record of how many eval
> > groups it contained; then when the regex engine compiles an eval group,
> > it checks to see whether it has met more eval groups so far than have
> > been interpolated from qrs; if so, it throws the 'Eval-group not
> > allowed' error.
>
> If the construct has no interpolations at all, yet does have eval
> groups, it will encounter more eval groups than have been interpolated
> from qr//. Is it the IN_PERL_RUNTIME that handles that part?

Err... I think so. IN_PERL_RUNTIME is defined as (PL_curcop !=
&PL_compiling); that is, 'we are not currently in the process of
compiling something'. Since regexen without interpolations are compiled
during compile time (as an optimization, originally), they won't be
caught. Again, really rather crude... :)

> And which of these variables' behaviors is changed by use re 'eval'?

Compiling a regex expression calls Perl_pmruntime to compile the pattern
match. pmruntime sets OPf_SPECIAL on the pp_regcomp op if re 'eval' was
in effect (if HINT_RE_EVAL was in the hint bits). Then pp_regcomp sets
PL_reginterp_cnt to I32_MAX if OPf_SPECIAL was set. (This is all in
5.8.8, of course. I imagine the details, although not the outcome, will
be different in both 5.6 and 5.10.)

> I think you could just add another flag:
> if (PL_reginterp_cnt < ++RExC_seen_evals
> && IN_PERL_RUNTIME && AT_LEAST_ONE_NON_QR_INTERPOLATION)
>
> Of course, assuring the flag is set appropriately would be the hard part,
> surely beyond my competence.

The place to do it would be in pp_regcomp, in the /* multiple args:
concatenate them */ section. This is what does the actual interpolation
(it's effectively a join).

> And if 5.10 is going to be quite different, then I guess there is no
> point.

Heh... on checking, I find that I was wrong about the interpolation
behaviour changing. It was discussed on p5p, but (for the moment at
least) the interpolation of patterns like /foo\1/ seems to be too hard
to deal with...

> Ben Morrow wrote:
> >
> > For reasons of security, this construct is forbidden if the regular
> > expression contains variable interpolations, unless it results from
> > the interpolation of a C, or C is in effect.
>
> That is still confusing to me. I keep reading the antecedent of
> the pronoun in "unless it results from" as being the interpolation, not the
> code construct. I (now) know that this is the incorrect way to read it,
> yet still that is how I read it when I approach it from the viewpoint of
> someone who doesn't already know the answer.

Yes, I see what you mean. So, stop trying to fit so much in one sentence
:)

For reasons of security, this construct is forbidden if the regular
expression contains variable interpolations. To remove this
limitation, compile the code assertion into a C first, or
C.

Ben

Re: Do my variables scoped to a subroutine get reconstructed every time?

am 12.09.2007 03:33:44 von Ben Morrow

Quoth Clint Olsen :
> In the spirit of this post, I was curious as to whether the my variables
> that are the result of a qr/// are reconstructed every time a subroutine is
> executed.

You can determine exactly what the regex engine is doing (in more detail
than you'd like :) ) by running your script with -Mre=debug .

Ben

Re: Do my variables scoped to a subroutine get reconstructed every time?

am 12.09.2007 05:00:11 von Clint Olsen

On 2007-09-12, Ben Morrow wrote:
> You can determine exactly what the regex engine is doing (in more detail
> than you'd like :) ) by running your script with -Mre=debug .

I will try that. With a trivial example, I did see a ton of output. I
can't say I understand much of it, but it's most certainly there.

Thanks,

-Clint

Re: (?{ code }) block works fine in child rule but not in parent

am 13.09.2007 07:02:05 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Ben Morrow
], who wrote in article :
> > Was there ever a time when qr// was the only way to have different parts
> > of the pattern have different "global" options, or was qr// introduced
> > after (?-xism:...) was introduced?

> No, that doesn't work....

[Not parsable.]

> a qr// is stringified before it's interpolated.

It may be not interpolated at all, and used "as is". Then
the way it is stringified does not matter.

> It *cannot* preserve its options unless it has the
> (?xism-:) syntax to preserve them into; since Ilya says that syntax was
> introduced for the sake of qr//s, they must have originally interpolated
> with different semantics from when they were compiled.

Correct. I decided that it would take longer to write decent
documentation for this difference in behaviour, than to design and
implement "a sane behaviour". ;-)

Yours,
Ilya

Re: (?{ code }) block works fine in child rule but not in parent

am 13.09.2007 07:44:49 von Bart Lateur

Martijn Lievaart wrote:

>Did you look at Parser::Decent? Seems like the perfect tool for this job.

Except it's dead slow.

--
Bart.

Re: (?{ code }) block works fine in child rule but not in parent

am 13.09.2007 20:09:18 von Clint Olsen

On 2007-09-13, Bart Lateur wrote:
> Except it's dead slow.

It's probably dead slow because it lacks the ability to specify a separate
lexer, and unless you try to emulate the flex or re2c model of
scanning/buffering, it's going to suck. Lexing, not parsing, is the most
expensive part of the whole process, because the lexing phase must examine
every character of the input and classify it.

I've written one really big language translator in Perl, and I was able to
get very acceptable performance using Parse::Yapp and a hand-written lexer
using buffered read() calls (faster than $/ = undef). The hand-written
lexer part is the sore spot of the whole story since you cannot coordinate
pattern matching with buffer refills along the way. If you could somehow
force a buffer refill automatically when an EOB condition was detected, you
could very easily write high performance scanners in Perl.

I think better tools on this front would help Perl immensely, because Perl
in itself has far superior RE facilities to lex/flex/re2c, and the builtin
data structures and string manipulation facilities make it an excellent
proving ground for language recognizers and translators.

-Clint