possible ActivePerl substitution optimisation issue
am 12.06.2007 20:23:29 von vsimionescuHello
Sorry if I'm not at the right place with what I'm about to say, I've
searched for a better one and couldn't find it. If so, please tell me
where I should post this. (I also put it at ActiveState of course but
I thought on GGroups there also might be some interest.)
I think I might have found a problem with the way ActiveState Perl
substitutions are optimized (I use v5.8.8).
I have a big string (around 5M) and I perform a simple substitution of
a fixed string like "abcd" with another fixed string. The substituted
string appears about 20 thousand times in the big string.
Important: the substitution string is bigger than the substituted one.
If not, everything works well i.e. very fast. But if it's bigger, as I
said, suddenly the substitution becomes much slower, maybe 50 times or
so.
Thing is, I was able to write a function myself, *in Perl*, that does
the same substitution about 20 times faster. I'm reasonably sure there
are no errors since I've compared the output of both substitutions and
they are identical.
I've been doing some testing and it appears the problem occurs only
under low memory conditions. I have 256M RAM but with virtual memory I
was using around 380 when I tested. If I do the test with plenty of
RAM both substitutions take about the same time to complete.
Now what I believe may be the problem (I might be wrong): looks to me
that Perl does the job the straightforward simplistic way, that is it
just builds the result string as it goes, re-allocating and copying it
all if it exceeds its memory. This way if the replacement string is
smaller than the original one everything is fine, since it initially
gets a memory chunk of about the same size as the source string (maybe
a little bigger) but it doesn't ever fill it since the result is
smaller. When the result is bigger, it will happen many times that it
exceeds its memory so it must be re-allocated and re-copied. Which
when the file is very large and things happen to a great extent in
virtual memory looks like it can slow down the program significantly.
What I've done basically with my substitution is that I've initially
split the source string into 2 lists, one with the occurrences of the
initial string (it must not always be fixed, I use a function that's a
little bit more general) and the other one with the in-between sub-
strings. Then I create a result list that has the substitution string
and the in-between sub-strings, which I then join into the result
string. As I said it is about 20 times faster. Of course, if something
similar was done at a lower level in the Perl substitution itself it
would be probably many more times faster.
Since I know of no better way I'll just insert my Perl module text
below. If I call it with the /vsim switch it performs my own
substitution, otherwise the standard Perl one. My comments are in
romanian, sorry. If anybody is interested and has a problem with this
I'll translate them to english, but I believe everything should be
pretty clear like it is.
(The local path must be also provided as an argument.)
Regards
V. Simionescu
(my $localpath = undef, my $vsim = undef);
for (my $k = 0; $k <= $#ARGV; $k++)
{
#print "arg. $k: $ARGV[$k]\n";
if ($ARGV[$k] =~ m"^ localpath=(.*) "xs)
{
$localpath = $1;
$localpath =~ s"(\\ | \s)+ $""xgsi; # sterg evt. \ finale
$ARGV[$k] = undef;
}
if ($ARGV[$k] =~ m"^ /vsim $"xs)
{
$vsim = 1;
$ARGV[$k] = undef;
}
}
use warnings;
use strict;
my $fileName = $localpath. "\\testSubstPerl.xml";
#print "file name: $fileName\n";
my $string = ReadFile ($fileName) or die "ReadFile failed for
$fileName";
if ($vsim)
{
Replace (\$string, '
Output ($localpath. "\\testSubstPerl_result_vsim.xml", $string);
}
else
{
$string =~ s"
Output ($localpath. "\\testSubstPerl_result.xml", $string);
}
# ------------------------------------------------------------ --
sub ReadFile
{
my $fileName = $_[0];
my $enc = $_[1];
open my $inFile, $fileName or return;
if (defined $enc) {
binmode $inFile, $enc; }
$/ = undef;
my $rez = <$inFile>;
close $inFile;
return $rez;
}
sub Output
{
my $fileName = $_[0];
my $refContent = \$_[1];
my $enc = $_[2];
if ($fileName !~ m"^>"xs) {
$fileName = ">$fileName"; }
open my $outFile, $fileName or return;
if (defined $enc) {
binmode $outFile, $enc; }
print $outFile $$refContent;
close $outFile;
return 1;
}
# ------------------------------------------------------------ --
sub SplitByPattern
{
my $stringRef = \$_[0];
my $pattern = $_[1];
my $refPat = $_[2];
my $refRest = $_[3];
my @undefined;
@$refPat = @undefined;
@$refRest = @undefined;
my $posAnt = 0;
while ($$stringRef =~ m"($pattern)"xgs)
{
$$refPat[@$refPat] = $1;
$$refRest[@$refRest] = substr ($$stringRef, $posAnt,
pos ($$stringRef) - length ($1)
- $posAnt);
$posAnt = pos ($$stringRef);
}
$$refRest[@$refRest] = substr ($$stringRef, $posAnt, length ($
$stringRef) - $posAnt);
# ptr. a reduce memoria folosita, si ptr. a preveni ca
programatorul sa
# prelucreze in paralel si string-ul splituit (intre
SplitByPattern si
# PutBackTogether), ceea ce e clar ca nu e bine sa faca
$$stringRef = "<
}
sub PutBackTogether
{
my $stringRef = $_[0];
my $refPat = $_[1];
my $refRest = $_[2];
die unless int (@$refRest) == int (@$refPat) + 1;
my @result;
for (my $i = 0; $i < int (@$refPat); $i++)
{
$result[2 * $i + 1] = $$refPat[$i];
}
for (my $i = 0; $i < int (@$refRest); $i++)
{
$result[2 * $i] = $$refRest[$i];
}
$$stringRef = join ("", @result);
}
# ------------------------------------------------------------ --
# intoarce nr. de inlocuiri
# apel: $nrMatches = Replace (\$string, $searchString,
$replaceString);
#
sub Replace
{
my $argIndex = 0;
my $stringRef = $_[$argIndex++];
my $s = $_[$argIndex++];
my $r = $_[$argIndex++];
die unless defined $string;
die unless defined $s;
die unless defined $r;
(my @patterns, my @rest);
SplitByPattern ($$stringRef, $s, \@patterns, \@rest);
my $nrMatches = int (@patterns);
foreach my $p (@patterns)
{
$p = $r;
}
PutBackTogether ($stringRef, \@patterns, \@rest);
return $nrMatches;
}
# ------------------------------------------------------------ --