Re: RobotRules fails on user-agents with spaces
am 14.10.2005 11:37:14 von gisle
> The problem... if I include a space in my robot's user agent, it
> will fail to recognize robots.txt records targeted to my robot.
You are not allowed to have space in the user agent name. See section
"3.8 Product Tokens" of RFC 2616 [1]. Isn't it an option to just
rename your spider to something that follows the spec?
> My robot's user agent:
> Hispanic Business Inc. Spider/1.0
>
> Robots.txt file:
> User-agent: Hispanic Business Inc. Spider
> Disallow:
>
> User-agent: *
> Disallow: /
>
> My robot will incorrectly refuse to spider anything, because
> WWW::RobotRules::agent shortens $self->{'ua'} to "Hispanic".
>
> I propose the attached patch to the RobotRules.pm included in libwww-perl 5.803
I'm not really opposed to this patch if product names with spaces are
actually in common use. Do you have data to suggest it is?
Regards,
Gisle
[1] http://www.faqs.org/rfcs/rfc2616.html
> --- libwww-perl-5.803/lib/WWW/RobotRules.pm.original 2005-10-13 16:26:27.000000000 -0700
> +++ libwww-perl-5.803/lib/WWW/RobotRules.pm 2005-10-13 16:27:27.000000000 -0700
> @@ -185,8 +185,8 @@
> # "FooBot/1.2" => "FooBot"
> # "FooBot/1.2 [http://foobot.int; foo@bot.int]" => "FooBot"
>
> - $name = $1 if $name =~ m/(\S+)/; # get first word
> $name =~ s!/.*!!; # get rid of version
> + $name =~ s/\s+$//; # get rid of trailing space
> unless ($old && $old eq $name) {
> delete $self->{'loc'}; # all old info is now stale
> $self->{'ua'} = $name;