Possible problem with RobotRules?

Possible problem with RobotRules?

am 19.12.2004 02:12:28 von j_and_t

I recently came accross something that didn't seem right to me. I'm using
"WWW::RobotRules::AnyDBM_File", but the below sample script will return the
same thing.

The URL I tested is:
http://www.midwestoffroad.com/

The robots.txt reads:

User-agent: *
Disallow: admin.php
Disallow: error.php
Disallow: /admin/
Disallow: /images/
Disallow: /includes/
Disallow: /themes/
Disallow: /blocks/
Disallow: /modules/
Disallow: /language/
User-agent: Baidu
Disallow: /

RobotRules returns that the URL is denied by robots.txt which should not be
the case. A stripped script is:

use WWW::RobotRules;
my $rules = WWW::RobotRules->new('MOMspider/1.0');

use LWP::Simple qw(get);

my $url = "http://www.midwestoffroad.com/robots.txt";
my $robots_txt = get $url;
$rules->parse($url, $robots_txt) if defined $robots_txt;

if($rules->allowed('http://www.midwestoffroad.com/')) {
print qq!Allowed by robots.txt\n\n!;
}else {
print qq!Denied by robots.txt\n\n!;
}
exit();

Which prints out "Denied by robots.txt".

Thanks

Re: Possible problem with RobotRules?

am 19.12.2004 03:18:02 von liam

On Sat, 18 Dec 2004, J and T wrote:

> I recently came accross something that didn't seem right to me. I'm using
> "WWW::RobotRules::AnyDBM_File", but the below sample script will return the
> same thing.
>
> The URL I tested is:
> http://www.midwestoffroad.com/
>
> The robots.txt reads:
>
> User-agent: *
> Disallow: admin.php
> Disallow: error.php
> Disallow: /admin/
> Disallow: /images/
> Disallow: /includes/
> Disallow: /themes/
> Disallow: /blocks/
> Disallow: /modules/
> Disallow: /language/
> User-agent: Baidu
> Disallow: /
>
> RobotRules returns that the URL is denied by robots.txt which should not be
> the case.

That's debatable. The robots.txt file is invalid according to
:

The file consists of one or more records separated by one or more
blank lines
[...]
The record starts with one or more User-agent lines, followed by one
or more Disallow lines

So "Disallow: /" is part of the record begun with "User-agent: *". It's
reasonable to ignore the misplaced "User-agent: Baidu" or to treat it as
though it were placed at the start of the record.

--
Liam Quinn

Re: Possible problem with RobotRules?

am 19.12.2004 09:32:05 von j_and_t

Hi Liam Quinn,

I understand what you're saying and I completely agree with you if I had not
read something different at the w3c.org and that Yahoo! indexed the example
site below. (please notice Subject "Possible" problem with RobotRules?)

According to this document:

http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes .html#h-B.4.1.1

B.4.1 Search robots
The robots.txt file

It states:

Some tips: URI's are case-sensitive, and "/robots.txt" string must be all
lower-case. Blank lines are not permitted.

"Blank lines are not permitted." is stated here and I wouldn't have asked
this question if the W3C was not the one stating this. I personally believe
the W3C is in error, but there are a lot of people who believe the W3C is
God here.

So who do we believe and who is correct? Isn't the W3C the authority on this
stuff? This is why I posted this question as I feel we need some
clarification.

Thanks!

>On Sat, 18 Dec 2004, J and T wrote:
>
> > I recently came accross something that didn't seem right to me. I'm
>using
> > "WWW::RobotRules::AnyDBM_File", but the below sample script will return
>the
> > same thing.
> >
> > The URL I tested is:
> > http://www.midwestoffroad.com/
> >
> > The robots.txt reads:
> >
> > User-agent: *
> > Disallow: admin.php
> > Disallow: error.php
> > Disallow: /admin/
> > Disallow: /images/
> > Disallow: /includes/
> > Disallow: /themes/
> > Disallow: /blocks/
> > Disallow: /modules/
> > Disallow: /language/
> > User-agent: Baidu
> > Disallow: /
> >
> > RobotRules returns that the URL is denied by robots.txt which should not
>be
> > the case.
>
>That's debatable. The robots.txt file is invalid according to
>:
>
> The file consists of one or more records separated by one or more
> blank lines
> [...]
> The record starts with one or more User-agent lines, followed by one
> or more Disallow lines
>
>So "Disallow: /" is part of the record begun with "User-agent: *". It's
>reasonable to ignore the misplaced "User-agent: Baidu" or to treat it as
>though it were placed at the start of the record.
>
>--
>Liam Quinn
>

Re: Possible problem with RobotRules?

am 19.12.2004 19:16:35 von liam

On Sun, 19 Dec 2004, J and T wrote:

> I understand what you're saying and I completely agree with you if I had not
> read something different at the w3c.org and that Yahoo! indexed the example
> site below. (please notice Subject "Possible" problem with RobotRules?)
>
> According to this document:
>
> http://www.w3.org/TR/1998/REC-html40-19980424/appendix/notes .html#h-B.4.1.1
>
> B.4.1 Search robots
> The robots.txt file
>
> It states:
>
> Some tips: URI's are case-sensitive, and "/robots.txt" string must be all
> lower-case. Blank lines are not permitted.
>
> "Blank lines are not permitted." is stated here and I wouldn't have asked
> this question if the W3C was not the one stating this. I personally believe
> the W3C is in error, but there are a lot of people who believe the W3C is
> God here.

The W3C's error is noted in the errata for the old version of
HTML 4 that you cited, and it's corrected in the latest HTML 4
Recommendation.

http://www.w3.org/MarkUp/html40-updates/REC-html40-19980424- errata.html

The specification reads, "Blank lines are not permitted." Blank lines
are permitted in the robots.txt file, just not within a single "record".
Note that the specification doesn't define record.

http://www.w3.org/TR/html4/appendix/notes.html#h-B.4.1.1

Some tips: URI's are case-sensitive, and "/robots.txt" string must be
all lower-case. Blank lines are not permitted within a single record in the
"robots.txt" file.

> Isn't the W3C the authority on this
> stuff?

Not on robots.txt. The W3C's section on robots.txt is buried in an
appendix to the HTML 4 Recommendation and preceded with "The following
notes are informative, not normative."

--
Liam Quinn

mechanize and recorder end up with blank page

am 19.12.2004 19:46:44 von igzebedze

hi all

i'm terribly sorry to bug you again with the same problem as a month ago. the thing is, i don't get it.

i'm trying to script a "add story" to a postnuke managed site.

after login all i get is a blank page. completelly empty, but with the correct URL. this happens with mechanize and with http:recorder via http::proxy, so i guess it's not my code that matters... i am probably missing a library or something right?

anyway, this is my procedure:


$agent->get("http://www.kiberpipa.org/admin");
if ($agent->success) {
print "logging into cyberpipe... \n";
} else {
&report_status("didn't get login screen");
die "post failed:",$agent->response->status_line;
}
$agent->form_number(1);
$agent->field("pass", $cp_pass);
$agent->field("uname", $cp_uname);
$agent->field("url", "http://www.kiberpipa.org/admin/");
$agent->field("module", "NS-User");
$agent->field("op", "login");
$agent->click("prijava");

if ($agent->success) {
&report_status("uspe􏺹na prijava na ".$agent->uri);
} else {
&report_status("didn't login");
die "post failed:",$agent->response->status_line;
}

# this seems allright

$agent->get("http://www.kiberpipa.org/admin/");
if ($agent->success) {
print "going to send story\n";
} else {
&report_status("didn't get admin screen");
die "post failed:",$agent->response->status_line;
}

$agent->get("http://www.kiberpipa.org/admin.php?module=NS-Ad dStory&op=main");
if ($agent->success) {
print "sending story to ".$agent->uri()."\n";
} else {
&report_status("didn't get upload screen");
die "post failed:",$agent->response->status_line;
}

# prints "sending story to.. " and correct url, that is, http://www.kiberpipa.org/admin.php?module=NS-AddStory&op =main


$agent->form_number(3);
$agent->field("subject", $naslov);
$agent->field("hometext", $besedilo);
$agent->field("bodytext", $razsirjeno);
$agent->field("op", "PostAdminStory");
$agent->field("module", "NS-AddStory");

if ($agent->success) {
&report_status("$naslov uspe􏺹no dodan na ".$agent->uri);
} else {
&report_status("nisem uspel poslati zgodbe!");
die "post failed:",$agent->response->status_line;
}

# returns: no form numbered 3, which is quite normal, because the page it gets is empty.





regards, bostjan