Help with Mechanize

Help with Mechanize

am 15.01.2007 19:17:22 von joneswr

------=_NextPart_000_0001_01C7389F.1F176190
Content-Type: text/plain;
charset="us-ascii"
Content-Transfer-Encoding: 7bit

Hello,

I could use some help with Mechanize and Andy Lester recommended I post
an email on the libwww mailing list. I am trying to do what should be a
simple scrape of the us patent and trademark website for bibliographic
info that they post for all patents. Unfortunately I keep getting
re-routed to a page that says

"We are unable to display the requested information. Please note that
all requests must be made using this form."

Do you think I am out of luck or are there some things I can try? The
form that is used to request the patent info does have the following
javascript line:



Basically, I am wondering how the website could know that I am using
mechanize and not internet explorer to enter the info into the fields
and click "submit."


Here is my perl code. Thanks.


#!/usr/local/bin/perl -w
print "Content-type: text/html\n\n";
use strict;
use WWW::Mechanize;
use Crypt::SSLeay;
my $url = "https://ramps.uspto.gov/eram/";
my $maintenancepatent = "5771669";
my $maintenanceapp = "08672157";
my $outfile = "out.htm";
my $mech = WWW::Mechanize->new( autocheck => 1);
$mech->proxy(['https'], '');
$mech->get($url);
$mech->follow_link(text => "Pay or Look up Patent Maintenance Fees", n
=> 1);
$mech->form_name('mfInputForm');
$mech->field(patentNum => "$maintenancepatent");
$mech->field(applicationNum => "$maintenanceapp");
$mech->add_header( Referer => $url );
$mech->click_button (number => 2);
open(OUTFILE, ">$outfile");
my $output_page = $mech->content();
print OUTFILE "$output_page";
close(OUTFILE);
print "done";

------=_NextPart_000_0001_01C7389F.1F176190--

Re: Help with Mechanize

am 16.01.2007 15:49:42 von Peter

On Mon, 15 Jan 2007 12:17:22 -0600, William Jones wrote:
> I could use some help with Mechanize and Andy Lester recommended I post an
> email on the libwww mailing list. I am trying to do what should be a
> simple scrape of the us patent and trademark website for bibliographic
> info that they post for all patents. Unfortunately I keep getting
> re-routed to a page that says
>
> Basically, I am wondering how the website could know that I am using
> mechanize and not internet explorer to enter the info into the fields and
> click "submit."

You could set the user_agent, but see below.

> Here is my perl code. Thanks.
>
>
> #!/usr/local/bin/perl -w
> print "Content-type: text/html\n\n";
> use strict;
> use WWW::Mechanize;
> use Crypt::SSLeay;
> my $url = "https://ramps.uspto.gov/eram/"; my $maintenancepatent =
> "5771669";
> my $maintenanceapp = "08672157";
> my $outfile = "out.htm";
> my $mech = WWW::Mechanize->new( autocheck => 1); $mech->proxy(['https'],
> '');
> $mech->get($url);
> $mech->follow_link(text => "Pay or Look up Patent Maintenance Fees", n =>
> 1);
> $mech->form_name('mfInputForm');
> $mech->field(patentNum => "$maintenancepatent");
> $mech->field(applicationNum => "$maintenanceapp"); $mech->add_header(
> Referer => $url ); $mech->click_button (number => 2);
> open(OUTFILE, ">$outfile");
> my $output_page = $mech->content();
> print OUTFILE "$output_page";
> close(OUTFILE);
> print "done";

I would say one of two things: either (a) you've made more requests than
their terms of service permit and your IP is blacklisted, or (b) you've
got something unnecessary above, because when I try it with less code than
you've got, it works:

$ perl -MWWW::Mechanize -de '$m = WWW::Mechanize->new; 1'

Loading DB routines from perl5db.pl version 1.28
Editor support available.

Enter h or `h h' for help, or `man perldebug' for more help.

main::(-e:1): $m = WWW::Mechanize->new; 1
DB<1> n
main::(-e:1): $m = WWW::Mechanize->new; 1
DB<1> $m->get("https://ramps.uspto.gov/eram/")

DB<2> $m->follow_link(text => "Pay or Look up Patent Maintenance Fees", n =>1) or die

DB<3> $m->form_name('mfInputForm')

DB<4> $m->field(patentNum => "5771669")

DB<5> $m->field(applicationNum => "08672157")

DB<6> $m->click_button(number=>2)

DB<7> p $m->content(format=>'text')
USPTO - Patent Bibliographic Data (Patent Number: 5771669) Patent
Bibliographic Data01/16/2007 09:46 AMPatent Number:5771669Application
Number:08672157Issue Date:06/30/1998Filing Date:06/27/1996Title:METHOD
AND APPARATUS FOR MOWING IRREGULAR TURF AREAS[...]

--
Peter Scott
http://www.perlmedic.com/
http://www.perldebugged.com/