FAQ 6.11 Can I use Perl regular expressions to match balanced text?

FAQ 6.11 Can I use Perl regular expressions to match balanced text?

am 31.12.2007 03:03:02 von PerlFAQ Server

This is an excerpt from the latest version perlfaq6.pod, which
comes with the standard Perl distribution. These postings aim to
reduce the number of repeated questions as well as allow the community
to review and update the answers. The latest version of the complete
perlfaq is at http://faq.perl.org .

------------------------------------------------------------ --------

6.11: Can I use Perl regular expressions to match balanced text?



Historically, Perl regular expressions were not capable of matching
balanced text. As of more recent versions of perl including 5.6.1
experimental features have been added that make it possible to do this.
Look at the documentation for the (??{ }) construct in recent perlre
manual pages to see an example of matching balanced parentheses. Be sure
to take special notice of the warnings present in the manual before
making use of this feature.

CPAN contains many modules that can be useful for matching text
depending on the context. Damian Conway provides some useful patterns in
Regexp::Common. The module Text::Balanced provides a general solution to
this problem.

One of the common applications of balanced text matching is working with
XML and HTML. There are many modules available that support these needs.
Two examples are HTML::Parser and XML::Parser. There are many others.

An elaborate subroutine (for 7-bit ASCII only) to pull out balanced and
possibly nested single chars, like "`" and "'", "{" and "}", or "(" and
")" can be found in
http://www.cpan.org/authors/id/TOMC/scripts/pull_quotes.gz .

The C::Scan module from CPAN also contains such subs for internal use,
but they are undocumented.



------------------------------------------------------------ --------

The perlfaq-workers, a group of volunteers, maintain the perlfaq. They
are not necessarily experts in every domain where Perl might show up,
so please include as much information as possible and relevant in any
corrections. The perlfaq-workers also don't have access to every
operating system or platform, so please include relevant details for
corrections to examples that do not work on particular platforms.
Working code is greatly appreciated.

If you'd like to help maintain the perlfaq, see the details in
perlfaq.pod.

Re: FAQ 6.11 Can I use Perl regular expressions to match balanced text?

am 31.12.2007 15:49:31 von espie

In article ,
PerlFAQ Server wrote:
>6.11: Can I use Perl regular expressions to match balanced text?
>
>
>
> Historically, Perl regular expressions were not capable of matching
> balanced text. As of more recent versions of perl including 5.6.1
> experimental features have been added that make it possible to do this.
> Look at the documentation for the (??{ }) construct in recent perlre
> manual pages to see an example of matching balanced parentheses. Be sure
> to take special notice of the warnings present in the manual before
> making use of this feature.

I've been using the following trick for many practical applications.


It helps to know why regexps cannot match balanced texts. Specifically,
normal regexps match rational languages, and rational languages cannot
count to an arbitrary number.

Balanced texts are algebraic languages, and they can be generated by
a grammar. The simplest example, with just the parentheses, would be:
D -> (D)D |

This is just one possible grammar, this one is non-ambiguous, which means
less combinatorial explosion later on.

However, one can use this grammar to build a good approximation of it as
a regular expression.

Specifically, set D0 = 0
then build Dn = (D_{n-1})D_{n-1} | empty

In perl terms:

$e = "";
for (1 .. n) {
$e = "(?:\($e\)$e)|";
}

then Dn is the rational language of all balanced expressions that do not
nest more than n open parentheses.

This can be used to build a regexp that matches all balanced expressions up
to an set level of nesting parentheses automatically. In many cases,
you don't have to go higher than n=15 or 20...

Re: FAQ 6.11 Can I use Perl regular expressions to match balanced text?

am 01.01.2008 18:37:55 von brian d foy

In article , Marc Espie
wrote:

> In article ,
> PerlFAQ Server wrote:
> >6.11: Can I use Perl regular expressions to match balanced text?
> >
> >
> >
> > Historically, Perl regular expressions were not capable of matching
> > balanced text. As of more recent versions of perl including 5.6.1
> > experimental features have been added that make it possible to do this.
> > Look at the documentation for the (??{ }) construct in recent perlre
> > manual pages to see an example of matching balanced parentheses. Be sure
> > to take special notice of the warnings present in the manual before
> > making use of this feature.
>
> I've been using the following trick for many practical applications.
>
>
> It helps to know why regexps cannot match balanced texts.

That just changed with Perl 5.10, I think, with the recursive patterns,
so I'll need to update this answer. :)