mod_perl regex conundrum
am 15.07.2004 15:24:37 von Simon Miner------_=_NextPart_001_01C46A6E.FE9D9FCA
Content-Type: text/plain
Hello,
My company has just upgraded our Apache/mod_perl and supporting software
versions, and we are now seeing a strange phenomenon. A piece of code which
has always worked speedily up to this point, now occasionally, but
predictably, takes 10 times longer to execute. The code fragment in
question is this.
-----
my $quoted_link = qr{
((href|action)\s*=\s*) # matching a link in an href or action attribute
($1)
(["']) # starting link delimiter (' or ") ($3)
(
(https?://\w+?\.chr(istian)?book\.com)? # optional domain name (old
or new) ($4)
/ # / separating domain from path
(?![^\3\#\?>]+?\.(exe|sit|pdf|ra?m|mp3|wax|css|js)) # skip non-HTML
file types
[^\3\#\?>]+? # anything that can be in a path (no
anchors, query strings, etc.
) # everything before the ticket
(
[\?\#] # start of a query string or anchor
reference
[^\3]+? # query string or anchor reference
)? # everything after the ticket ($8)
\3 # ending link delimiter
}six;
....
$$text_ref =~ s{$quoted_link}{$1$3$4/$ticket$8$3}gx;
---
The purpose of this code is to tag each URL on a web page with a session ID.
The $text_ref variable is a scalar reference to the page content, and the
$ticket variable contains the ID.
In an effort to debug why the slow-down was occurring, I replaced the
substitution in the snippet above with the following.
---
use Time::HiRes qw(time);
my $start = time;
my $elapsed;
my $count = 0;
print "
Text length = " . length( $$text_ref ) . "
";while ( $$text_ref =~ /$quoted_link/g ) {
$count += 1;
$elapsed = time - $start;
print "
link_filter2: $count, $& $elapsed
";}
---
This code prints out the cumulative time the regex is taking to match as it
encounters links on a web page. The problem continued to show itself with
this change, but I still don't know how to pinpoint why it's happening.
Here are some other noteworthy items pertaining to this problem.
* This problem only occurs under Apache/mod_perl. I saved the HTML of a
normal page and a slow web page to files, and copied the code above into a
small command line script. When I executed the script against the files,
the slow page took about the right
amount of time to process relative to the normal page (as opposed to 10
times as long). (BTW: the slow page is actually smaller than the normal
page, so it should be processed more quickly, which it did in the command
line script.)
* Our infrastructure upgrade made involved the following software updates.
-- Before upgrade: Apache 1.3.29, mod_perl 1.27, Perl 5.6.1, Solaris 5.8
-- After upgrade: Apache 1.3.31, mod_perl 1.29, Perl 5.8.3, Solaris 5.9
So, I am wondering...
* Can anyone suggest reasons why this code might be executing so slowly?
* Can anyone suggest potential improvements to the regex so it will execute
faster?
* Does anyone know of changes between the software versions mentioned above
that could lead to this behavior?
Thanks in advance for your help.
- Simon
-----------------------------------------------
Simon Miner
Applications Engineer
Christianbook.com
E:
T: (978) 573-2233
F: (978) 573-8233
-----------------------------------------------
------_=_NextPart_001_01C46A6E.FE9D9FCA
Content-Type: text/html
Content-Transfer-Encoding: quoted-printable
charset=3Dus-ascii">
style=3D'font-size:10.0pt;
font-family:Arial'>Hello,
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>My company has just upgraded our Apache/mod_perl and
supporting software versions, and we are now seeing a strange =
phenomenon.
A piece of code which has always worked speedily up to this point, now
occasionally, but predictably, takes 10 times longer to execute. =
The code
fragment in question is this.
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>-----
style=3D'font-size:10.0pt;
font-family:Courier'>my $quoted_link =3D qr{
style=3D'font-size:10.0pt;
font-family:Courier'> ((href|action)\s*=3D\s*) =
# matching
a link in an href or action attribute ($1)
style=3D'font-size:10.0pt;
font-family:Courier'>
(["']) &=
nbsp;
# starting link delimiter (' or ") ($3)
style=3D'font-size:10.0pt;
font-family:Courier'> (
style=3D'font-size:10.0pt;
font-family:Courier'> =
(https?://\w+?\.chr(istian)?book\.com)?
# optional domain name (old or new) ($4)
style=3D'font-size:10.0pt;
font-family:Courier'>
/  =
;  =
;
# / separating domain from path
style=3D'font-size:10.0pt;
font-family:Courier'> =
(?![^\3\#\?>]+?\.(exe|sit|pdf|ra?m|mp3|wax|css|js))
# skip non-HTML file types
style=3D'font-size:10.0pt;
font-family:Courier'>
[^\3\#\?>]+? &nb=
sp;
# anything that can be in a path (no anchors, query strings, =
etc.
style=3D'font-size:10.0pt;
font-family:Courier'>
)  =
;
# everything before the ticket
style=3D'font-size:10.0pt;
font-family:Courier'> (
style=3D'font-size:10.0pt;
font-family:Courier'>
[\?\#] =
=
# start of a query string or anchor reference
style=3D'font-size:10.0pt;
font-family:Courier'>
[^\3]+?  =
;
# query string or anchor reference
style=3D'font-size:10.0pt;
font-family:Courier'>
)? &nbs=
p;
# everything after the ticket ($8)
style=3D'font-size:10.0pt;
font-family:Courier'>
\3
=
#
ending link delimiter
style=3D'font-size:10.0pt;
font-family:Courier'>}six;
style=3D'font-size:10.0pt;
font-family:Courier'>...
style=3D'font-size:10.0pt;
font-family:Courier'>$$text_ref =3D~ =
s{$quoted_link}{$1$3$4/$ticket$8$3}gx;
style=3D'font-size:10.0pt;
font-family:Arial'>---
style=3D'font-size:10.0pt;
font-family:Arial'>The purpose of this code is to tag each URL on a web =
page
with a session ID. The $text_ref variable is a scalar reference =
to the
page content, and the $ticket variable contains the =
ID.
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>In an effort to debug why the slow-down was =
occurring, I
replaced the substitution in the snippet above with the =
following.
style=3D'font-size:10.0pt;
font-family:Arial'>---
style=3D'font-size:10.0pt;
font-family:Courier'>use Time::HiRes qw(time);
style=3D'font-size:10.0pt;
font-family:Courier'>my $start =3D time;
style=3D'font-size:10.0pt;
font-family:Courier'>my $elapsed;
style=3D'font-size:10.0pt;
font-family:Courier'>my $count =3D 0;
style=3D'font-size:10.0pt;
font-family:Courier'>print "<p>Text length =3D " . =
length(
$$text_ref ) . "</p>";
style=3D'font-size:10.0pt;
font-family:Courier'>while ( $$text_ref =3D~ /$quoted_link/g ) =
{
style=3D'font-size:10.0pt;
font-family:Courier'> $count +=3D 1;
style=3D'font-size:10.0pt;
font-family:Courier'> $elapsed =3D time - =
$start;
style=3D'font-size:10.0pt;
font-family:Courier'> print "<p>link_filter2: $count, =
$&
$elapsed</p>";
style=3D'font-size:10.0pt;
font-family:Courier'>}
style=3D'font-size:10.0pt;
font-family:Arial'>---
style=3D'font-size:10.0pt;
font-family:Arial'>This code prints out the cumulative time the regex =
is taking
to match as it encounters links on a web page. The problem =
continued to
show itself with this change, but I still don't know how to pinpoint =
why
it's happening.
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>Here are some other noteworthy items pertaining to =
this
problem.
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>* This problem only occurs under =
Apache/mod_perl. I
saved the HTML of a normal page and a slow web page to files, and =
copied the
code above into a small command line script. When I executed the =
script
against the files, the slow page took about the right =
style=3D'font-size:10.0pt;
font-family:Arial'>amount of time to process relative to the normal =
page (as
opposed to 10 times as long). (BTW: the slow page is actually =
smaller
than the normal page, so it should be processed more quickly, which it =
did in
the command line script.)
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>* Our infrastructure upgrade made involved the =
following
software updates.
style=3D'font-size:10.0pt;
font-family:Arial'> -- Before upgrade: Apache 1.3.29, =
mod_perl
1.27, Perl 5.6.1, Solaris 5.8
style=3D'font-size:10.0pt;
font-family:Arial'> -- After upgrade: Apache 1.3.31, =
mod_perl 1.29,
Perl 5.8.3, Solaris 5.9
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>So, I am wondering...
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>* Can anyone suggest reasons why this code might be
executing so slowly?
style=3D'font-size:10.0pt;
font-family:Arial'>* Can anyone suggest potential improvements to the =
regex so
it will execute faster?
style=3D'font-size:10.0pt;
font-family:Arial'>* Does anyone know of changes between the software =
versions
mentioned above that could lead to this behavior?
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>Thanks in advance for your help.
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:10.0pt;
font-family:Arial'>- Simon
style=3D'font-size:10.0pt;
font-family:Arial'>
style=3D'font-size:7.5pt;
font-family:Verdana'>--------------------------------------- --------
an>
style=3D'font-size:7.5pt;
font-family:Verdana'>Simon Miner
style=3D'font-size:7.5pt;
font-family:Verdana'>Applications Engineer
style=3D'font-size:7.5pt;
font-family:Verdana'>Christianbook.com
style=3D'font-size:
7.5pt;font-family:Verdana'>E
face=3DVerdana>: =
href=3D"mailto:sminer@christianbook.com">
face=3DVerdana>
style=3D'font-size:7.5pt;font-family:Verdana'>sminer@christi anbook.com=
span>
style=3D'font-size:
7.5pt;font-family:Verdana'>T
face=3DVerdana>: =
(978) 573-2233
style=3D'font-size:
7.5pt;font-family:Verdana'>F
face=3DVerdana>: =
(978) 573-8233
style=3D'font-size:7.5pt;
font-family:Verdana'>--------------------------------------- --------
an>
style=3D'font-size:
12.0pt'>
------_=_NextPart_001_01C46A6E.FE9D9FCA--