Can someone "splain why this regex won"t work both ways?

Can someone "splain why this regex won"t work both ways?

am 14.04.2008 19:08:51 von spydox

I'm trying to find a repeated number in a string, like 122345 finds
22.

This works:

/(\d)\1/

This doesn't:

/\1(\d)/

I guess LLR parsing is to blame, but shouldn't the second example
first try to FIND a $1 then check to see if there is a \1, and repeat
that process moving L to R?

I though Perl sort of went to and fro trying to do matching. To me,
there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
preceeding it.

I was a little surprized this didn't work although I can sort of see
why in a way too. In some ways it seems to me that regexes should be
*disconnected* from parsing - just answer the question does this
match?

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 20:31:47 von Ben Morrow

Quoth spydox@gmail.com:
>
> I'm trying to find a repeated number in a string, like 122345 finds
> 22.
>
> This works:
>
> /(\d)\1/
>
> This doesn't:
>
> /\1(\d)/
>
> I guess LLR parsing is to blame, but shouldn't the second example
> first try to FIND a $1 then check to see if there is a \1, and repeat
> that process moving L to R?
>
> I though Perl sort of went to and fro trying to do matching. To me,
> there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
> preceeding it.

There are two separate operations here which you are confusing. First
perl parses the regex itself, and compiles it into an internal form.
Then it matches that regex against the string you provide. The second
will backtrack, under some circumstances; the first won't.

Ben

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 20:37:12 von 1usa

spydox@gmail.com wrote in
news:093bf887-729d-4400-8750-
6c91b21b478e@w4g2000prd.googlegroups.com
:

> I'm trying to find a repeated number in a string, like 122345
> finds 22.
>
> This works:
>
> /(\d)\1/
>
> This doesn't:
>
> /\1(\d)/
>
> I guess LLR parsing is to blame,

....

> I was a little surprized this didn't work although I can sort of
> see why in a way too. In some ways it seems to me that regexes
> should be *disconnected* from parsing - just answer the question
> does this match?

I don't look at this as a parsing issue. Rather, it is a "the
universe must make sense" kind of issue: The first match does not
exist before the first match. That makes sense to me. It may not
make sense to you.

Sinan
--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 20:51:21 von spydox

On Apr 14, 2:31 pm, Ben Morrow wrote:
> Quoth spy...@gmail.com:
>
>
>
>
>
> > I'm trying to find a repeated number in a string, like 122345 finds
> > 22.
>
> > This works:
>
> > /(\d)\1/
>
> > This doesn't:
>
> > /\1(\d)/
>
> > I guess LLR parsing is to blame, but shouldn't the second example
> > first try to FIND a $1 then check to see if there is a \1, and repeat
> > that process moving L to R?
>
> > I though Perl sort of went to and fro trying to do matching. To me,
> > there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
> > preceeding it.
>
> There are two separate operations here which you are confusing. First
> perl parses the regex itself, and compiles it into an internal form.
> Then it matches that regex against the string you provide. The second
> will backtrack, under some circumstances; the first won't.
>
> Ben

Understood, and I appreciate the insight. It makes sense.
Yet, when all else apparently *fails*, in my experience, and I've
heard MJD and others say this, Perl will "do its best" to match. To
me, unless it *also* tried backtracking, it gave up too soon..

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 20:57:46 von spydox

..
..
..
>
> > I guess LLR parsing is to blame,
>
..
..
>
> I don't look at this as a parsing issue. Rather, it is a "the
> universe must make sense" kind of issue: The first match does not
> exist before the first match. That makes sense to me. It may not
> make sense to you.
>

To me, like conventional pattern-recognition, of say two tanks next to
each other, the system should accept it whether the match is described
either way:

find a tank with another identical tank to it's left

*or*

find a tank with another identical tank to it's right


The system should have no *context-sensitivity* where only one of the
two matches. Sure, internally an algorithm may be scanning L to R or R
to L or whatever, but the user should not even be concerned with that,
at least in this case. I still think it gave up too soon- it should
have tried R to L (backtracking) when L to R failed.

Just IMHO, thank-you for your thoughts. This area seems just a bit
gray to me I'd be very interested in Damain or Mark's thoughts.

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 21:06:28 von Ben Morrow

Quoth spydox@gmail.com:
> On Apr 14, 2:31 pm, Ben Morrow wrote:
> > Quoth spy...@gmail.com:
> > >
> > > I'm trying to find a repeated number in a string, like 122345 finds
> > > 22.
> >
> > > This works:
> >
> > > /(\d)\1/
> >
> > > This doesn't:
> >
> > > /\1(\d)/
> >
> > > I guess LLR parsing is to blame, but shouldn't the second example
> > > first try to FIND a $1 then check to see if there is a \1, and repeat
> > > that process moving L to R?
> >
> > > I though Perl sort of went to and fro trying to do matching. To me,
> > > there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1 = 2
> > > preceeding it.
> >
> > There are two separate operations here which you are confusing. First
> > perl parses the regex itself, and compiles it into an internal form.
> > Then it matches that regex against the string you provide. The second
> > will backtrack, under some circumstances; the first won't.
>
> Understood, and I appreciate the insight. It makes sense.
> Yet, when all else apparently *fails*, in my experience, and I've
> heard MJD and others say this, Perl will "do its best" to match. To
> me, unless it *also* tried backtracking, it gave up too soon..

No, you're still not understanding. Perl will only backtrack *while
trying to match*. Compiling the regex comes long before that.

Ben

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 21:11:45 von Willem

spydox@gmail.com wrote:
) Understood, and I appreciate the insight. It makes sense.
) Yet, when all else apparently *fails*, in my experience, and I've
) heard MJD and others say this, Perl will "do its best" to match. To
) me, unless it *also* tried backtracking, it gave up too soon..

That's not what backtracking means.


SaSW, Willem
--
Disclaimer: I am in no way responsible for any of the statements
made in the above text. For all I know I might be
drugged or something..
No I'm not paranoid. You all think I'm paranoid, don't you !
#EOT

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 21:51:13 von INVALID_SEE_SIG

In the previous article, wrote:
> > > I guess LLR parsing is to blame,
> >
> .
> .
> >
> > I don't look at this as a parsing issue. Rather, it is a "the
> > universe must make sense" kind of issue: The first match does not
> > exist before the first match. That makes sense to me. It may not
> > make sense to you.
> >
>
> To me, like conventional pattern-recognition, of say two tanks next to
> each other, the system should accept it whether the match is described
> either way:
>
> find a tank with another identical tank to it's left
>
> *or*
>
> find a tank with another identical tank to it's right

A better phrasing:

find a tank, then find another one to its right

*or*

find another one to its left, then find a tank

One of these phrasings makes sense; the other does not. Or, rather,
the other doesn't and one of the phrasings makes sense.

If you want a more formal justification, here's what the Camel Book
says about these. Note the two instances of the word "later":

1.7.4. Backreferences

[...] A pair of parentheses around a part of a regular
expression causes whatever was matched by that part to be
remembered for later use. It doesn't change what the part
matches, so /\d+/ and /(\d+)/ will still match as many digits
as possible, but in the latter case they will be remembered in
a special variable to be backreferenced later.
--
_+_ From the catapult of |If anyone disagrees with any statement I make, I
_|70|___:)=}- J.D. Baldwin |am quite prepared not only to retract it, but also
\ / baldwin@panix.com|to deny under oath that I ever made it. -T. Lehrer
***~~~~----------------------------------------------------- ------------------

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 23:43:38 von 1usa

spydox@gmail.com wrote in
news:e6278092-e663-4ea6-8f07-40d65faeb551
@f63g2000hsf.googlegroups.co
m:

[ please do not snip attributions ]

>> > I guess LLR parsing is to blame,
>>
>> I don't look at this as a parsing issue. Rather, it is a "the
>> universe must make sense" kind of issue: The first match does not
>> exist before the first match. That makes sense to me. It may not
>> make sense to you.
>>
>
> To me, like conventional pattern-recognition, of say two tanks
> next to each other, the system should accept it whether the match
> is described either way:
>
> find a tank with another identical tank to it's left
>
> *or*
>
> find a tank with another identical tank to it's right
>
>
> The system should have no *context-sensitivity* where only one of
> the two matches. Sure, internally an algorithm may be scanning L
> to R or R to L or whatever, but the user should not even be
> concerned with that, at least in this case. I still think it gave
> up too soon- it should have tried R to L (backtracking) when L to
> R failed.

What you seem to want is a "match two identical characters"
operator. For this particular case, you can achieve that by using:

=for example

my @strings = qw( 1222345 1233345 );

s/00|11|22|33|44|55|66|77|88|99// for @strings;

print "$_\n" for @strings;

=cut

When you use a character class, every element of that class is
considered equivalent to every other one. So, for example, when you
write

/\d{2}/

that does find two characters that are in the same equivalence
class.

The tank analogy works perftectly here because there are no two
identical tanks in the world. Instead, there are equivalence classes
of tanks. Tanks that are the same model, tanks in the same unit etc.

If what you want is to say,

find a tank, then find another tank that is the same
model as the one you just found

well, that is equivalent to /(\d)\1/

J. D. Baldwin gives perfect examples of why /\1(\d)/ does not make
sense: Finding another tank in the same equivalence class as the one
you first found comes after first finding a tank.

> Just IMHO, thank-you for your thoughts. This area seems just a bit
> gray to me I'd be very interested in Damain or Mark's thoughts.

s/Damain/Damian/

My feeble mind looks at the following:

#!/usr/bin/perl

use strict;
use warnings;

use 5.010;

for ( my @a = qw( 1222345 1233345 ) ) {
s/(?\d)\K\k// and print "$_\n";
}

for ( my @a = qw( 1222345 1233345 ) ) {
s/(?\d)\K\k+// and print "$_\n";
}

for ( my @a = qw( 1222345 1233345 ) ) {
s/(?\d)\k// and print "$_\n";
}

__END__

thinks that the third one is the most natural (that is, find a tank,
then find another tank in the same equivalence class) to the other
ones.

Sinan

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)

comp.lang.perl.misc guidelines on the WWW:
http://www.rehabitation.com/clpmisc/

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 23:52:28 von xhoster

Ben Morrow wrote:
> Quoth spydox@gmail.com:
> > On Apr 14, 2:31 pm, Ben Morrow wrote:
> > > Quoth spy...@gmail.com:
> > > >
> > > > I'm trying to find a repeated number in a string, like 122345 finds
> > > > 22.
> > >
> > > > This works:
> > >
> > > > /(\d)\1/
> > >
> > > > This doesn't:
> > >
> > > > /\1(\d)/
> > >
> > > > I guess LLR parsing is to blame, but shouldn't the second example
> > > > first try to FIND a $1 then check to see if there is a \1, and
> > > > repeat that process moving L to R?
> > >
> > > > I though Perl sort of went to and fro trying to do matching. To me,
> > > > there IS a /\1(\d)/ in the string since $1 is 2, and there is a \1
> > > > = 2 preceeding it.
> > >
> > > There are two separate operations here which you are confusing. First
> > > perl parses the regex itself, and compiles it into an internal form.
> > > Then it matches that regex against the string you provide. The second
> > > will backtrack, under some circumstances; the first won't.
> >
> > Understood, and I appreciate the insight. It makes sense.
> > Yet, when all else apparently *fails*, in my experience, and I've
> > heard MJD and others say this, Perl will "do its best" to match. To
> > me, unless it *also* tried backtracking, it gave up too soon..
>
> No, you're still not understanding. Perl will only backtrack *while
> trying to match*. Compiling the regex comes long before that.

I think that that is what he is talking about, the when trying to match
part. His use of "parsing" in the original question was ill-advised, but I
think you latched onto the the bad phrasing and rather than the intended
question, and now won't let him correct his poor phrasing. First perl
parses and compiles the regular expression, then it uses that compiled
expression to match (or loosely speaking "parse") the target string.

Perl parses and compiles /\1(.)/ without error or warning (which surprised
me). But then what does it do with it?

Conceptually, it could temporarily treat the \1 as ".*", and then when/if
the capture is matched go back and verify that it is the same as the thing
previously matched by the tentative .* cum \1. I don't know that I would
call this backtracking (as the OP seems to be doing), but I can't think of
anything obviously better to call it. Or it could reorder things give
an identical compiled regex as /(.)\1/. I don't know if these two things
would give the same answer as each other in all cases (if so, the latter
would surely be faster).

I think that that is what the OP thought it should do. Obviously, Perl
doesn't do either of those thing. I can't figure out what it does do. I
thought it would treat \1 preceding any capture as the empty string, but
apparently it doesn't do that, either. It seems to act as something
unmatchable.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 23:53:32 von hjp-usenet2

On 2008-04-14 18:57, spydox@gmail.com wrote:
>> I don't look at this as a parsing issue. Rather, it is a "the
>> universe must make sense" kind of issue: The first match does not
>> exist before the first match. That makes sense to me. It may not
>> make sense to you.
>>
>
> To me, like conventional pattern-recognition, of say two tanks next to
> each other, the system should accept it whether the match is described
> either way:
>
> find a tank with another identical tank to it's left
>
> *or*
>
> find a tank with another identical tank to it's right
>
>
> The system should have no *context-sensitivity* where only one of the
> two matches. Sure, internally an algorithm may be scanning L to R or R
> to L or whatever, but the user should not even be concerned with that,
> at least in this case. I still think it gave up too soon- it should
> have tried R to L (backtracking) when L to R failed.

Backtracking doesn't mean scanning right to left. Backtracking means to
go back to the last point where you had a choice and try the other
alternative(s).

So, for example if you have a pattern /foo(bar|baz)/, after matching
"foo", you have a choice between trying to match "bar" or "baz". The
regex engine will try to match "bar" first, and if that fails, it will
backtrack to the point before it tried that and then try to match "baz"
instead.

But in a pattern like /\1(a)/ there is no choice: It needs to start by
matching the string in the first capture group, but that hasn't been
defined yet, so it must fail. (Well, it could try all possible strings,
but that would be extremely inefficient).

hp

Re: Can someone "splain why this regex won"t work both ways?

am 14.04.2008 23:58:04 von xhoster

news@baldwin.users.panix.com wrote:
> In the previous article, wrote:
> >
> > find a tank with another identical tank to it's left
> >
> > *or*
> >
> > find a tank with another identical tank to it's right
>
> A better phrasing:
>
> find a tank, then find another one to its right
>
> *or*
>
> find another one to its left, then find a tank
>
> One of these phrasings makes sense; the other does not. Or, rather,
> the other doesn't and one of the phrasings makes sense.
>
> If you want a more formal justification, here's what the Camel Book
> says about these. Note the two instances of the word "later":

I think that that is his point, an objection to the notion that
left and right typographically equate to "earlier" and "later"
chronologically.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: Can someone "splain why this regex won"t work both ways?

am 15.04.2008 00:03:16 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to

], who wrote in article <093bf887-729d-4400-8750-6c91b21b478e@w4g2000prd.googlegroups.com>:
>
> I'm trying to find a repeated number in a string, like 122345 finds
> 22.
>
> This works:
>
> /(\d)\1/
>
> This doesn't:
>
> /\1(\d)/

This depends on what you mean by "works". It works in the sense that
it does not match (as it should not). I do not find it documented in
perlre, but \3 will fail to match if group 3 did not match "yet".

Hope this helps,
Ilya

P.S. perl -Mre=debugcolor -wle "q(aa) =~ /\1(a)/"

Re: Can someone "splain why this regex won"t work both ways?

am 15.04.2008 01:28:48 von Ben Morrow

Quoth xhoster@gmail.com:
> Ben Morrow wrote:
> >
> > No, you're still not understanding. Perl will only backtrack *while
> > trying to match*. Compiling the regex comes long before that.
>
> I think that that is what he is talking about, the when trying to match
> part. His use of "parsing" in the original question was ill-advised, but I
> think you latched onto the the bad phrasing and rather than the intended
> question, and now won't let him correct his poor phrasing. First perl
> parses and compiles the regular expression, then it uses that compiled
> expression to match (or loosely speaking "parse") the target string.

You're right, I was misunderstanding the OP's misunderstanding. :)

> Perl parses and compiles /\1(.)/ without error or warning (which surprised
> me). But then what does it do with it?

I was assuming (without having tried it) that the regex was failing at
the compile stage. It seems I was wrong... :(

> Conceptually, it could temporarily treat the \1 as ".*", and then when/if
> the capture is matched go back and verify that it is the same as the thing
> previously matched by the tentative .* cum \1.

Something like this can be done with

/(.*)(a)(??{ $1 eq $2 ? "(?:)" : "(?!)" })/

using a code assertion to insert either a 'succeed' or a 'fail and
backtrack' item into the regex at runtime. Not that I'd recommend this,
of course... :)

> I don't know that I would
> call this backtracking (as the OP seems to be doing), but I can't think of
> anything obviously better to call it. Or it could reorder things give
> an identical compiled regex as /(.)\1/. I don't know if these two things
> would give the same answer as each other in all cases (if so, the latter
> would surely be faster).
>
> I think that that is what the OP thought it should do. Obviously, Perl
> doesn't do either of those thing. I can't figure out what it does do. I
> thought it would treat \1 preceding any capture as the empty string, but
> apparently it doesn't do that, either. It seems to act as something
> unmatchable.

All the $N start as undef, which is unmatchable as you found (-Mre=debug
is useful for finding out what is going on). Whenever perl backtracks to
retry part of a match, it clears any $N set by the part of the match it
is backtracking over, so /\1(.)/ couldn't match even if it did
backtrack, as $1 would be undef again by the time it got to retry the
\1. (Perl doesn't in fact backtrack, as it knows nothing has changed so
this would be an infinite loop.)

It is, however, possible to get \1 to match when it appears earlier in
the expression than the first brackets (which is why it's not a syntax
error); you just have to make sure it gets set first. For instance,

"abac" =~ /^(?:\1c|(a)b)+$/

matches. The first time through the +, $1 is undef so the \1c part fails;
but the (a)b part succeeds so $1 gets set. Then it goes round the + loop
again, and this time $1 is 'a' so the first branch can succeed.

Ben

Re: Can someone "splain why this regex won"t work both ways?

am 17.04.2008 00:54:39 von Charles DeRykus

On Apr 14, 4:28 pm, Ben Morrow wrote:
> ...
>
> It is, however, possible to get \1 to match when it appears earlier in
> the expression than the first brackets (which is why it's not a syntax
> error); you just have to make sure it gets set first. For instance,
>
> "abac" =~ /^(?:\1c|(a)b)+$/
>
> matches. The first time through the +, $1 is undef so the \1c part fails;
> but the (a)b part succeeds so $1 gets set. Then it goes round the + loop
> again, and this time $1 is 'a' so the first branch can succeed.
>

Slightly different example sans alternation could work too if $N is
optional:

"abab" =~ /(?:(\2)?(b))+/

I'd have the same enthusiasm for a root canal though...

--
Charles DeRykus

Re: Can someone "splain why this regex won"t work both ways?

am 17.04.2008 05:06:23 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
comp.llang.perl.moderated
], who wrote in article :
> Slightly different example sans alternation could work too if $N is
> optional:
>
> "abab" =~ /(?:(\2)?(b))+/
>
> I'd have the same enthusiasm for a root canal though...

Extra parens are considered harmfull:

perl -wle "q(abab) =~ /(?: \1? (b) )+/x or die"

(not much beauty gained or lost, of course...).

Yours,
Ilya