FAQ 6.19 What good is "/G" in a regular expression?

FAQ 6.19 What good is "/G" in a regular expression?

am 18.10.2007 15:03:03 von PerlFAQ Server

This is an excerpt from the latest version perlfaq6.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

------------------------------------------------------------ --------

6.19: What good is "\G" in a regular expression?


You use the "\G" anchor to start the next match on the same string where
the last match left off. The regular expression engine cannot skip over
any characters to find the next match with this anchor, so "\G" is
similar to the beginning of string anchor, "^". The "\G" anchor is
typically used with the "g" flag. It uses the value of "pos()" as the
position to start the next match. As the match operator makes successive
matches, it updates "pos()" with the position of the next character past
the last match (or the first character of the next match, depending on
how you like to look at it). Each string has its own "pos()" value.

Suppose you want to match all of consecutive pairs of digits in a string
like "1122a44" and stop matching when you encounter non-digits. You want
to match 11 and 22 but the letter shows up between 22 and 44 and you
want to stop at "a". Simply matching pairs of digits skips over the "a"
and still matches 44.

$_ = "1122a44";
my @pairs = m/(\d\d)/g; # qw( 11 22 44 )

If you use the "\G" anchor, you force the match after 22 to start with
the "a". The regular expression cannot match there since it does not
find a digit, so the next match fails and the match operator returns the
pairs it already found.

$_ = "1122a44";
my @pairs = m/\G(\d\d)/g; # qw( 11 22 )

You can also use the "\G" anchor in scalar context. You still need the
"g" flag.

$_ = "1122a44";
while( m/\G(\d\d)/g )
{
print "Found $1\n";
}

After the match fails at the letter "a", perl resets "pos()" and the
next match on the same string starts at the beginning.

$_ = "1122a44";
while( m/\G(\d\d)/g )
{
print "Found $1\n";
}

print "Found $1 after while" if m/(\d\d)/g; # finds "11"

You can disable "pos()" resets on fail with the "c" flag, documented in
perlop and perlreref. Subsequent matches start where the last successful
match ended (the value of "pos()") even if a match on the same string
has failed in the meantime. In this case, the match after the "while()"
loop starts at the "a" (where the last match stopped), and since it does
not use any anchor it can skip over the "a" to find 44.

$_ = "1122a44";
while( m/\G(\d\d)/gc )
{
print "Found $1\n";
}

print "Found $1 after while" if m/(\d\d)/g; # finds "44"

Typically you use the "\G" anchor with the "c" flag when you want to try
a different match if one fails, such as in a tokenizer. Jeffrey Friedl
offers this example which works in 5.004 or later.

while (<>) {
chomp;
PARSER: {
m/ \G( \d+\b )/gcx && do { print "number: $1\n"; redo; };
m/ \G( \w+ )/gcx && do { print "word: $1\n"; redo; };
m/ \G( \s+ )/gcx && do { print "space: $1\n"; redo; };
m/ \G( [^\w\d]+ )/gcx && do { print "other: $1\n"; redo; };
}
}

For each line, the "PARSER" loop first tries to match a series of digits
followed by a word boundary. This match has to start at the place the
last match left off (or the beginning of the string on the first match).
Since "m/ \G( \d+\b )/gcx" uses the "c" flag, if the string does not
match that regular expression, perl does not reset pos() and the next
match starts at the same position to try a different pattern.



------------------------------------------------------------ --------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.