How do I use a variable-width positive look behind assertion?

am 06.11.2007 23:52:14 von jl_post

Hi,

I have some code that splits text right after a line containing
only a single dot. To do this, I used a positive look-behind
assertion, like in this sample script:

#!/usr/bin/perl

use strict;
use warnings;

my $text = <<"END_OF_TEXT";
Line 1
..
Line 2
..
Line 3
END_OF_TEXT

# Use a positive look-behind assertion
# to split $text right after a dot on
# a line by itself:
my @elements = split m/(?<=\n\.\n)/, $text;

use Data::Dumper;
print Dumper @elements;

__END__

Running this program gives the output:

$VAR1 = 'Line 1
..
';
$VAR2 = 'Line 2
..
';
$VAR3 = 'Line 3
';

Basically, $text was split into three elements, which each element
(except for the last) ending with a dot (on a line by itself).

This positive look-behind assertion works great if the dot is truly
on a line by itself. But it there was leading and/or trailing
whitespace with the dot (on a line by itself), the regular expression
m/(?<=\n\.\n)/ won't split after that line.

What I'm looking for is to do something like this:

#!/usr/bin/perl

use strict;
use warnings;

my $text = <<"END_OF_TEXT";
Line 1
.
Line 2
.
Line 3
END_OF_TEXT

# Use a positive look-behind assertion
# to split $text right after a dot on
# a line by itself:
my @elements = split m/(?<=\n\s*\.\s*\n)/, $text;

use Data::Dumper;
print Dumper @elements;

__END__

(Note that the "." lines now have leading whitespace and that
split()'s regular expression has now changed to m/(?<=\n\s*\.\s*
\n)/ .)

The regular expression in split() now checks for leading and
trailing whitespace, but when I try to run the script I get the
following error:

Variable length lookbehind not implemented in regex; marked by <--
HERE in m/(?<=\n\s*\.\s*\n) <-- HERE / at splitOnDot.pl line 17.

This error isn't surprising, as "perldoc perlre" says that a
positive look-behind assertion "works only for fixed-width look-
behind."

Therein lies my problem. I want to split right after a line that
contains exactly one dot and an arbitrary amount of whitespace, but I
don't think I can do it with split() using a simple regular
expression. Or can I?

I have developed a work-around, and that's to replace the split()
line with the following lines:

my @elements = split m/(\n\s*\.\s*\n)/, $text;
{
my @newElements;
push @newElements, shift(@elements) . shift(@elements)
while @elements > 1;
push @newElements, @elements;

@elements = @newElements;
}

Because of the parentheses in split()'s regular expression (notice it
no longer uses a look-behind assertion), the line containing a dot
(and whitespace) on a line by itself is now split out into a separate
element. The block of code that follows it is just "reassembling" the
elements so that the "dot-line" is now attached to the previous
element.

This method does what I want, but it feels like a hack. Is there a
better way to do what I want?

Thanks in advance for any advice.

-- Jean-Luc

Re: How do I use a variable-width positive look behind assertion?

am 07.11.2007 00:38:49 von Jim Gibson

In article <1194389534.282536.278580@o80g2000hse.googlegroups.com>,
<"jl_post@hotmail.com"> wrote:

> Hi,
>
> I have some code that splits text right after a line containing
> only a single dot. To do this, I used a positive look-behind
> assertion, like in this sample script:
>
>
> #!/usr/bin/perl
>
> use strict;
> use warnings;
>
> my $text = <<"END_OF_TEXT";
> Line 1
> .
> Line 2
> .
> Line 3
> END_OF_TEXT
>
> # Use a positive look-behind assertion
> # to split $text right after a dot on
> # a line by itself:
> my @elements = split m/(?<=\n\.\n)/, $text;
>
> use Data::Dumper;
> print Dumper @elements;
>
> __END__
>
>
> Running this program gives the output:
>
> $VAR1 = 'Line 1
> .
> ';
> $VAR2 = 'Line 2
> .
> ';
> $VAR3 = 'Line 3
> ';
>
> Basically, $text was split into three elements, which each element
> (except for the last) ending with a dot (on a line by itself).
>
> This positive look-behind assertion works great if the dot is truly
> on a line by itself. But it there was leading and/or trailing
> whitespace with the dot (on a line by itself), the regular expression
> m/(?<=\n\.\n)/ won't split after that line.

> my @elements = split m/(\n\s*\.\s*\n)/, $text;
> {
> my @newElements;
> push @newElements, shift(@elements) . shift(@elements)
> while @elements > 1;
> push @newElements, @elements;
>
> @elements = @newElements;
> }
>
> Because of the parentheses in split()'s regular expression (notice it
> no longer uses a look-behind assertion), the line containing a dot
> (and whitespace) on a line by itself is now split out into a separate
> element. The block of code that follows it is just "reassembling" the
> elements so that the "dot-line" is now attached to the previous
> element.
>
> This method does what I want, but it feels like a hack. Is there a
> better way to do what I want?

Capture the (lines + separator line) substrings, plus an alternation to
get the substring after the last separator:

#!/usr/local/bin/perl
use strict;
use warnings;
use Data::Dumper;

my $text = <<"END_OF_TEXT";
Line 1
.
Line 2
.
Line 3
END_OF_TEXT

my @elements = ($text =~
m{ \G ( (?:.*? \n \s* \. \s* \n) | (?:.* \z) ) }gxs);

print Dumper @elements;

__OUTPUT__

$VAR1 = 'Line 1
.
';
$VAR2 = 'Line 2
.
';
$VAR3 = 'Line 3
';
$VAR4 = '';

--
Jim Gibson

Posted Via Usenet.com Premium Usenet Newsgroup Services
----------------------------------------------------------
** SPEED ** RETENTION ** COMPLETION ** ANONYMITY **
----------------------------------------------------------
http://www.usenet.com

Re: How do I use a variable-width positive look behind assertion?

am 07.11.2007 01:25:04 von rvtol+news

jl_post@hotmail.com schreef:

You take far too many words and lines to explain your simple problem.
Further, the subject doesn't mention the problem at hand, but your
problem with some assumed way to solve it.

> I want to split right after a line that
> contains exactly one dot and an arbitrary amount of whitespace, but I
> don't think I can do it with split() using a simple regular
> expression. Or can I?

If you can afford to lose the separating lines:

split /^[[:blank:]]*\.[[:blank:]]*\n/m, $text;

If you want to keep the separating lines too:

split /(^ [[:blank:]]* \. [[:blank:]]* \n )/mx, $text;

But why use split() at all? Alternative that keeps the blocks together:

#!/usr/bin/perl
use strict;
use warnings;

my $text = join "", ;

my @elements =
$text =~
m{ .*?
(?:
^
[[:blank:]]*
[.]
[[:blank:]]*
\n
|
\z
)
}msxg;

print "[$_]\n" for grep $_, @elements;

__DATA__
Line 1
.
Line 2
.
.
Line 3

(things just get a bit more difficult when there is no newline at the
end of the file, and when there is an ebedded empty block)

--
Affijn, Ruud

"Gewoon is een tijger."