Patch for WWW::RobotsRules.pm
am 16.09.2004 19:25:05 von moseley
I've got a spider that uses LWP::RobotUA (WWW::RobotRules) and a few
users of the spider have complained that the warning messages were
not obvious enough. I guess I can agree because when they are
spidering multiple hosts the message doesn't tell them what robots.txt
had a problem.
So maybe something like:
--- RobotRules.pm.old 2004-04-09 08:37:08.000000000 -0700
+++ RobotRules.pm 2004-09-16 09:46:03.000000000 -0700
@@ -70,7 +70,7 @@
}
elsif (/^\s*Disallow\s*:\s*(.*)/i) {
unless (defined $ua) {
- warn "RobotRules: Disallow without preceding User-agent\n";
+ warn "RobotRules: [$robot_txt_uri] Disallow without preceding User-agent\n";
$is_anon = 1; # assume that User-agent: * was intended
}
my $disallow = $1;
@@ -97,7 +97,7 @@
}
}
else {
- warn "RobotRules: Unexpected line: $_\n";
+ warn "RobotRules: [$robot_txt_uri] Unexpected line: $_\n";
}
}
--
Bill Moseley
moseley@hank.org
Re: Patch for WWW::RobotsRules.pm
am 12.11.2004 17:21:30 von gisle
Bill Moseley writes:
> I've got a spider that uses LWP::RobotUA (WWW::RobotRules) and a few
> users of the spider have complained that the warning messages were
> not obvious enough. I guess I can agree because when they are
> spidering multiple hosts the message doesn't tell them what robots.txt
> had a problem.
The patch I've now applied is this one:
Index: lib/WWW/RobotRules.pm
============================================================ =======
RCS file: /cvsroot/libwww-perl/lwp5/lib/WWW/RobotRules.pm,v
retrieving revision 1.31
retrieving revision 1.32
diff -u -p -u -r1.31 -r1.32
--- lib/WWW/RobotRules.pm 12 Nov 2004 16:05:09 -0000 1.31
+++ lib/WWW/RobotRules.pm 12 Nov 2004 16:14:25 -0000 1.32
@@ -1,8 +1,8 @@
package WWW::RobotRules;
-# $Id: RobotRules.pm,v 1.31 2004/11/12 16:05:09 gisle Exp $
+# $Id: RobotRules.pm,v 1.32 2004/11/12 16:14:25 gisle Exp $
-$VERSION = sprintf("%d.%02d", q$Revision: 1.31 $ =~ /(\d+)\.(\d+)/);
+$VERSION = sprintf("%d.%02d", q$Revision: 1.32 $ =~ /(\d+)\.(\d+)/);
sub Version { $VERSION; }
use strict;
@@ -70,7 +70,7 @@ sub parse {
}
elsif (/^\s*Disallow\s*:\s*(.*)/i) {
unless (defined $ua) {
- warn "RobotRules: Disallow without preceding User-agent\n";
+ warn "RobotRules <$robot_txt_uri>: Disallow without preceding User-agent\n" if $^W;
$is_anon = 1; # assume that User-agent: * was intended
}
my $disallow = $1;
@@ -97,7 +97,7 @@ sub parse {
}
}
else {
- warn "RobotRules: Unexpected line: $_\n";
+ warn "RobotRules <$robot_txt_uri>: Unexpected line: $_\n" if $^W;
}
}
> So maybe something like:
>
> --- RobotRules.pm.old 2004-04-09 08:37:08.000000000 -0700
> +++ RobotRules.pm 2004-09-16 09:46:03.000000000 -0700
> @@ -70,7 +70,7 @@
> }
> elsif (/^\s*Disallow\s*:\s*(.*)/i) {
> unless (defined $ua) {
> - warn "RobotRules: Disallow without preceding User-agent\n";
> + warn "RobotRules: [$robot_txt_uri] Disallow without preceding User-agent\n";
> $is_anon = 1; # assume that User-agent: * was intended
> }
> my $disallow = $1;
> @@ -97,7 +97,7 @@
> }
> }
> else {
> - warn "RobotRules: Unexpected line: $_\n";
> + warn "RobotRules: [$robot_txt_uri] Unexpected line: $_\n";
> }
> }