Is there a maximum size to Perl strings?

am 05.01.2005 16:29:00 von Bryan Feeney

I've noticed this problem in a message board script of mine lately. The
site has a file in the Java Properties format listing various ways of
blocking users - e.g. usernames, email addresses, or message content.
Each is a list of regexps.

Here's a sample from the file

# Content is a Perl RE, however the . operator doesn't work, nor
# can you specify ranges using the {n,m} notation. Unless otherwise
# specified (using ^ and $) the given text is assumed to be a fragment,
# not the entirety.
content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk,
(?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
MONEY, \
THAI\s*MASSAGE, (?i)www.cgispy.com, \
(?i)<textarea, (?i)textarea>, \
(?i)Paypal, (?i)\$1.00\s+Bill, \
DOLLARS, \
GUARANTE+D?

As you can see it's a series of comma delimited RegExps, which has been
broken onto multiple lines using the '\' character.

It worked fine, until it grew in size (that's about a third of what I
have for content). It appears that the bottom of the section isn't been
read.

What the code does for reading in sections is easy enough:

my %blocks;
while (not eof (BLKS))
{ my ($key, $value) = split (/\s*[=:]\s*/, );
next if ($key =~ /^#/);
$key =~ s/^\s+//gi;
while ($value =~ /\\\s*$/g and not eof (BLKS))
{ $value =~ s/\\\s*$//g;

my $valueContinued = ;
$value .= $valueContinued;
}
$value =~ s/\s*$//sgi;
$value =~ s/^\s*//sgi;
$value =~ s/\r?\n//gi;

$blocks{lc ($key)} = $value;
}

# The content block accept minor REGEXP patterns. However dots are out!
$blocks{content} =~ s/\./\\\./gi;

@blockedContent = split (/\s*,\s*/, $blocks{content});

When a message is posted, the code then loops through every regexp in
the @blockedContent array, applying it in turn, and dying if any of them
match. Does anyone know why the whole of the block isn't being read in?
Is it a problem with String size, or the size of an array?

Thanks
--
Bryan

Re: Is there a maximum size to Perl strings?

am 07.01.2005 13:17:32 von someone

Bryan Feeney wrote:
> I've noticed this problem in a message board script of mine lately. The
> site has a file in the Java Properties format listing various ways of
> blocking users - e.g. usernames, email addresses, or message content.
> Each is a list of regexps.
>
> Here's a sample from the file
>
> # Content is a Perl RE, however the . operator doesn't work, nor
> # can you specify ranges using the {n,m} notation. Unless otherwise
> # specified (using ^ and $) the given text is assumed to be a fragment,
> # not the entirety.
> content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk,
> (?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
> MONEY, \
> THAI\s*MASSAGE, (?i)www.cgispy.com, \
> (?i)<textarea, (?i)textarea>, \
> (?i)Paypal, (?i)\$1.00\s+Bill, \
> DOLLARS, \
> GUARANTE+D?
>
> As you can see it's a series of comma delimited RegExps, which has been
> broken onto multiple lines using the '\' character.
>
> It worked fine, until it grew in size (that's about a third of what I
> have for content). It appears that the bottom of the section isn't been
> read.

You don't have a '\' character at the end of the first line so that is
probably why it is not reading the whole section.

> What the code does for reading in sections is easy enough:
>
> my %blocks;
> while (not eof (BLKS))
> { my ($key, $value) = split (/\s*[=:]\s*/, );
> next if ($key =~ /^#/);
> $key =~ s/^\s+//gi;
^^
You are using the /g option which means that you want to match the pattern
everywhere in the string but you have anchored the pattern to the beginning of
the string so the /g option is superfluous. You are using the /i option which
means to ignore the case on alphabetic characters in the pattern but there are
no alphabetic characters in the pattern so the /i option is also superfluous.

> while ($value =~ /\\\s*$/g and not eof (BLKS))
^
The /g option is superfluous.

> { $value =~ s/\\\s*$//g;
^
The /g option is superfluous.

> my $valueContinued = ;
> $value .= $valueContinued;
> }
> $value =~ s/\s*$//sgi;
^^^
You are using the /s option which means to include the newline character when
using . to match any character but you are not using the . meta-character in
the pattern so the /s option is superfluous as are the /g and /i options.

> $value =~ s/^\s*//sgi;
^^^
The /s, /g and /i options are superfluous.

> $value =~ s/\r?\n//gi;
^
The /i option is superfluous.

> $blocks{lc ($key)} = $value;
> }
>
> # The content block accept minor REGEXP patterns. However dots are out!
> $blocks{content} =~ s/\./\\\./gi;
^
The /i option is superfluous.

> @blockedContent = split (/\s*,\s*/, $blocks{content});

I would write it like this:

my ( $section, %blocks );
while ( ) {
next if /^#/ or not /\S/;

s/\A\s+//;
s/\s+\z//;

if ( s/\\$// ) {
$section .= $_;
next;
}

my ( $key, $value ) = split /\s*[=:]\s*/, $section, 2 or next;
$section = '';
$value =~ s'\.'\.'g;
$blocks{ lc $key } = [ split /\s*,\s*/, $value ];
}

my @blockedContent = @{ $blocks{ content } };

__DATA__

# Content is a Perl RE, however the . operator doesn't work, nor
# can you specify ranges using the {n,m} notation. Unless otherwise
# specified (using ^ and $) the given text is assumed to be a fragment,
# not the entirety.

content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk, \
(?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
MONEY, \
THAI\s*MASSAGE, (?i)www.cgispy.com, \
(?i)<textarea, (?i)textarea>, \
(?i)Paypal, (?i)\$1.00\s+Bill, \
DOLLARS, \
GUARANTE+D?

__END__

John
--
use Perl;
program
fulfillment

Re: Is there a maximum size to Perl strings?

am 14.01.2005 02:29:27 von Kevin Carlson

My understanding is that strings in Perl can essentially be as large
as all available RAM...

On Wed, 05 Jan 2005 15:29:00 +0000, Bryan Feeney
wrote:

>I've noticed this problem in a message board script of mine lately. The
>site has a file in the Java Properties format listing various ways of
>blocking users - e.g. usernames, email addresses, or message content.
>Each is a list of regexps.
>
>Here's a sample from the file
>
># Content is a Perl RE, however the . operator doesn't work, nor
># can you specify ranges using the {n,m} notation. Unless otherwise
># specified (using ^ and $) the given text is assumed to be a fragment,
># not the entirety.
>content: (?i)cgi.ebay., (?i)www.ebay., (?i)www.dontpayretail.co.uk,
> (?i)www.\S*sex\S*.co, (?i)www.goyemen.com, \
> MONEY, \
> THAI\s*MASSAGE, (?i)www.cgispy.com, \
> (?i)<textarea, (?i)textarea>, \
> (?i)Paypal, (?i)\$1.00\s+Bill, \
> DOLLARS, \
> GUARANTE+D?
>
>As you can see it's a series of comma delimited RegExps, which has been
>broken onto multiple lines using the '\' character.
>
>It worked fine, until it grew in size (that's about a third of what I
>have for content). It appears that the bottom of the section isn't been
>read.
>
>What the code does for reading in sections is easy enough:
>
>my %blocks;
>while (not eof (BLKS))
>{ my ($key, $value) = split (/\s*[=:]\s*/, );
> next if ($key =~ /^#/);
> $key =~ s/^\s+//gi;
> while ($value =~ /\\\s*$/g and not eof (BLKS))
> { $value =~ s/\\\s*$//g;
>
> my $valueContinued = ;
> $value .= $valueContinued;
> }
> $value =~ s/\s*$//sgi;
> $value =~ s/^\s*//sgi;
> $value =~ s/\r?\n//gi;
>
> $blocks{lc ($key)} = $value;
>}
>
># The content block accept minor REGEXP patterns. However dots are out!
>$blocks{content} =~ s/\./\\\./gi;
>
>@blockedContent = split (/\s*,\s*/, $blocks{content});
>
>
>When a message is posted, the code then loops through every regexp in
>the @blockedContent array, applying it in turn, and dying if any of them
>match. Does anyone know why the whole of the block isn't being read in?
>Is it a problem with String size, or the size of an array?
>
>Thanks