Possible problem with RobotRules?
Possible problem with RobotRules?
am 19.12.2004 02:12:28 von j_and_t
I recently came accross something that didn't seem right to me. I'm using
"WWW::RobotRules::AnyDBM_File", but the below sample script will return the
same thing.
The URL I tested is:
http://www.midwestoffroad.com/
The robots.txt reads:
User-agent: *
Disallow: admin.php
Disallow: error.php
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /modules/
Disallow: /language/
User-agent: Baidu
Disallow: /
RobotRules returns that the URL is denied by robots.txt which should not be
the case. A stripped script is:
use WWW::RobotRules;
my $rules = WWW::RobotRules->new('MOMspider/1.0');
use LWP::Simple qw(get);
my $url = "http://www.midwestoffroad.com/robots.txt";
my $robots_txt = get $url;
$rules->parse($url, $robots_txt) if defined $robots_txt;
if($rules->allowed('http://www.midwestoffroad.com/')) {
print qq!Allowed by robots.txt\n\n!;
}else {
print qq!Denied by robots.txt\n\n!;
}
exit();
Which prints out "Denied by robots.txt".
Thanks
Re: Possible problem with RobotRules?
am 19.12.2004 03:18:02 von liam
On Sat, 18 Dec 2004, J and T wrote:
> I recently came accross something that didn't seem right to me. I'm using
> "WWW::RobotRules::AnyDBM_File", but the below sample script will return the
> same thing.
>
> The URL I tested is:
> http://www.midwestoffroad.com/
>
> The robots.txt reads:
>
> User-agent: *
> Disallow: admin.php
> Disallow: error.php
> Disallow: /admin/
> Disallow: /images/
> Disallow: /includes/
> Disallow: /themes/
> Disallow: /blocks/
> Disallow: /modules/
> Disallow: /language/
> User-agent: Baidu
> Disallow: /
>
> RobotRules returns that the URL is denied by robots.txt which should not be
> the case.
That's debatable. The robots.txt file is invalid according to
:
The file consists of one or more records separated by one or more
blank lines
[...]
The record starts with one or more User-agent lines, followed by one
or more Disallow lines
So "Disallow: /" is part of the record begun with "User-agent: *". It's
reasonable to ignore the misplaced "User-agent: Baidu" or to treat it as
though it were placed at the start of the record.
--
Liam Quinn
Re: Possible problem with RobotRules?
am 19.12.2004 09:32:05 von j_and_t
Hi Liam Quinn,
I understand what you're saying and I completely agree with you if I had not
read something different at the w3c.org and that Yahoo! indexed the example
site below. (please notice Subject "Possible" problem with RobotRules?)
According to this document:
http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes .html#h-B.4.1.1
B.4.1 Search robots
The robots.txt file
It states:
Some tips: URI's are case-sensitive, and "/robots.txt" string must be all
lower-case. Blank lines are not permitted.
"Blank lines are not permitted." is stated here and I wouldn't have asked
this question if the W3C was not the one stating this. I personally believe
the W3C is in error, but there are a lot of people who believe the W3C is
God here.
So who do we believe and who is correct? Isn't the W3C the authority on this
stuff? This is why I posted this question as I feel we need some
clarification.
Thanks!
>On Sat, 18 Dec 2004, J and T wrote:
>
> > I recently came accross something that didn't seem right to me. I'm
>using
> > "WWW::RobotRules::AnyDBM_File", but the below sample script will return
>the
> > same thing.
> >
> > The URL I tested is:
> > http://www.midwestoffroad.com/
> >
> > The robots.txt reads:
> >
> > User-agent: *
> > Disallow: admin.php
> > Disallow: error.php
> > Disallow: /admin/
> > Disallow: /images/
> > Disallow: /includes/
> > Disallow: /themes/
> > Disallow: /blocks/
> > Disallow: /modules/
> > Disallow: /language/
> > User-agent: Baidu
> > Disallow: /
> >
> > RobotRules returns that the URL is denied by robots.txt which should not
>be
> > the case.
>
>That's debatable. The robots.txt file is invalid according to
>:
>
> The file consists of one or more records separated by one or more
> blank lines
> [...]
> The record starts with one or more User-agent lines, followed by one
> or more Disallow lines
>
>So "Disallow: /" is part of the record begun with "User-agent: *". It's
>reasonable to ignore the misplaced "User-agent: Baidu" or to treat it as
>though it were placed at the start of the record.
>
>--
>Liam Quinn
>
Re: Possible problem with RobotRules?
am 19.12.2004 19:16:35 von liam
On Sun, 19 Dec 2004, J and T wrote:
> I understand what you're saying and I completely agree with you if I had not
> read something different at the w3c.org and that Yahoo! indexed the example
> site below. (please notice Subject "Possible" problem with RobotRules?)
>
> According to this document:
>
> http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes .html#h-B.4.1.1
>
> B.4.1 Search robots
> The robots.txt file
>
> It states:
>
> Some tips: URI's are case-sensitive, and "/robots.txt" string must be all
> lower-case. Blank lines are not permitted.
>
> "Blank lines are not permitted." is stated here and I wouldn't have asked
> this question if the W3C was not the one stating this. I personally believe
> the W3C is in error, but there are a lot of people who believe the W3C is
> God here.
The W3C's error is noted in the errata for the old version of
HTML 4 that you cited, and it's corrected in the latest HTML 4
Recommendation.
http://www.w3.org/MarkUp/html40-updates/REC-html40-19980424- errata.html
The specification reads, "Blank lines are not permitted." Blank lines
are permitted in the robots.txt file, just not within a single "record".
Note that the specification doesn't define record.
http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1
Some tips: URI's are case-sensitive, and "/robots.txt" string must be
all lower-case. Blank lines are not permitted within a single record in the
"robots.txt" file.
> Isn't the W3C the authority on this
> stuff?
Not on robots.txt. The W3C's section on robots.txt is buried in an
appendix to the HTML 4 Recommendation and preceded with "The following
notes are informative, not normative."
--
Liam Quinn
mechanize and recorder end up with blank page
am 19.12.2004 19:46:44 von igzebedze
hi all
i'm terribly sorry to bug you again with the same problem as a month ago. the thing is, i don't get it.
i'm trying to script a "add story" to a postnuke managed site.
after login all i get is a blank page. completelly empty, but with the correct URL. this happens with mechanize and with http:recorder via http::proxy, so i guess it's not my code that matters... i am probably missing a library or something right?
anyway, this is my procedure:
$agent->get("http://www.kiberpipa.org/admin");
if ($agent->success) {
print "logging into cyberpipe... \n";
} else {
&report_status("didn't get login screen");
die "post failed:",$agent->response->status_line;
}
$agent->form_number(1);
$agent->field("pass", $cp_pass);
$agent->field("uname", $cp_uname);
$agent->field("url", "http://www.kiberpipa.org/admin/");
$agent->field("module", "NS-User");
$agent->field("op", "login");
$agent->click("prijava");
if ($agent->success) {
&report_status("uspeôº¹na prijava na ".$agent->uri);
} else {
&report_status("didn't login");
die "post failed:",$agent->response->status_line;
}
# this seems allright
$agent->get("http://www.kiberpipa.org/admin/");
if ($agent->success) {
print "going to send story\n";
} else {
&report_status("didn't get admin screen");
die "post failed:",$agent->response->status_line;
}
$agent->get("http://www.kiberpipa.org/admin.php?module=NS-Ad dStory&op=main");
if ($agent->success) {
print "sending story to ".$agent->uri()."\n";
} else {
&report_status("didn't get upload screen");
die "post failed:",$agent->response->status_line;
}
# prints "sending story to.. " and correct url, that is, http://www.kiberpipa.org/admin.php?module=NS-AddStory&op =main
$agent->form_number(3);
$agent->field("subject", $naslov);
$agent->field("hometext", $besedilo);
$agent->field("bodytext", $razsirjeno);
$agent->field("op", "PostAdminStory");
$agent->field("module", "NS-AddStory");
if ($agent->success) {
&report_status("$naslov uspeôº¹no dodan na ".$agent->uri);
} else {
&report_status("nisem uspel poslati zgodbe!");
die "post failed:",$agent->response->status_line;
}
# returns: no form numbered 3, which is quite normal, because the page it gets is empty.
regards, bostjan