[PATCH] Caching/reusing WWW::RobotRules(::InCore)

[PATCH] Caching/reusing WWW::RobotRules(::InCore)

am 12.10.2004 08:54:11 von ville.skytta

--=-WBe+CFWa6yDKakuhfgbY
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

The current behaviour of LWP::RobotUA, when passed in an existing
WWW::RobotRules::InCore object is counterintuitive to me.

I am of this opinion because of the documentation of $rules in
LWP::RobotUA->new() and WWW::RobotRules->agent(), as well as the
implementation in WWW::RobotRules::AnyDBM_File.

Currently, W::R::InCore empties the cache always when agent() is called,
regardless if the agent name changed or not. W::R::AnyDBM_File does not
seem to have this problem.

I suggest applying the attached patch to fix this.

Additionally, I see InCore and AnyDBM_File use a different algorithm for
getting the "short" agent name from the full one, with the AnyDBM_File
looking "older". Perhaps add a new method/function for this (eg.
short_agent()) in WWW::RobotRules that could be used in both InCore and
AnyDBM_File?

While on the robots subject, applying something like the "warning could
be more helpful" change from
http://www.xray.mpe.mpg.de/mailing-lists/libwww-perl/2004-08 /msg00024.html would be most welcome.

--=-WBe+CFWa6yDKakuhfgbY
Content-Disposition: inline; filename=robotrules-agent.patch
Content-Type: text/x-patch; name=robotrules-agent.patch; charset=iso-8859-1
Content-Transfer-Encoding: 7bit

Index: lib/WWW/RobotRules.pm
============================================================ =======
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.30
diff -a -u -r1.30 RobotRules.pm
--- lib/WWW/RobotRules.pm 9 Apr 2004 15:09:14 -0000 1.30
+++ lib/WWW/RobotRules.pm 12 Oct 2004 06:39:34 -0000
@@ -185,10 +185,12 @@
# "FooBot/1.2" => "FooBot"
# "FooBot/1.2 [http://foobot.int; foo@bot.int]" => "FooBot"

- delete $self->{'loc'}; # all old info is now stale
$name = $1 if $name =~ m/(\S+)/; # get first word
$name =~ s!/.*!!; # get rid of version
- $self->{'ua'}=$name;
+ unless ($old && $old eq $name) {
+ delete $self->{'loc'}; # all old info is now stale
+ $self->{'ua'} = $name;
+ }
}
$old;
}

--=-WBe+CFWa6yDKakuhfgbY--

Re: [PATCH] Caching/reusing WWW::RobotRules(::InCore)

am 12.11.2004 17:15:15 von gisle

Ville Skyttä writes:

> The current behaviour of LWP::RobotUA, when passed in an existing
> WWW::RobotRules::InCore object is counterintuitive to me.
>=20
> I am of this opinion because of the documentation of $rules in
> LWP::RobotUA->new() and WWW::RobotRules->agent(), as well as the
> implementation in WWW::RobotRules::AnyDBM_File.
>=20
> Currently, W::R::InCore empties the cache always when agent() is called,
> regardless if the agent name changed or not. W::R::AnyDBM_File does not
> seem to have this problem.
>=20
> I suggest applying the attached patch to fix this.

Applied. Will be in 5.801.

Regards,
Gisle


> Index: lib/WWW/RobotRules.pm
> ==================== =====
==================== =====3D=
==================
> RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
> retrieving revision 1.30
> diff -a -u -r1.30 RobotRules.pm
> --- lib/WWW/RobotRules.pm 9 Apr 2004 15:09:14 -0000 1.30
> +++ lib/WWW/RobotRules.pm 12 Oct 2004 06:39:34 -0000
> @@ -185,10 +185,12 @@
> # "FooBot/1.2" =3D> "FooB=
ot"
> # "FooBot/1.2 [http://foobot.int; foo@bot.int]" =3D> "FooB=
ot"
>
> - delete $self->{'loc'}; # all old info is now stale
> $name =3D $1 if $name =3D~ m/(\S+)/; # get first word
> $name =3D~ s!/.*!!; # get rid of version
> - $self->{'ua'}=3D$name;
> + unless ($old && $old eq $name) {
> + delete $self->{'loc'}; # all old info is now stale
> + $self->{'ua'} =3D $name;
> + }
> }
> $old;
> }