Mechanize - redirect problem
Mechanize - redirect problem
am 22.02.2005 23:26:59 von martin
hi
as i've seen that this list is used for mechanize questions too i'll try
my question...
i try to login to the page http://mymobile.sunrise.ch/ but it seems like
mechanize is not doing the redirect that is on the start site... if i
try with my browser or wget i get redirect to a page like
http://mymobile.sunrise.ch/portal/res/guest;jsessionid=HCCIS J1USYYSVQFIGZAXRAQ?paf_dm=full&paf_gear_id=100001&?successUR L=/portal/res/member%3Bjsessionid%3DHCCISJ1USYYSVQFIGZAXRAQ
i tried it with a simple "get" but it doesn't work and i don't see what
the problem could be... any idea what i'm doing wrong?
btw. im using mechanize 1.08
regards
KoS
--
Martin Kos +41-76-384-93-33
http://kos.li Say NO to HTML in mail ICQ# 13556143
Proudly running Debian GNU/Linux
Re: Mechanize - redirect problem
am 23.02.2005 10:20:25 von peter.stevens
--------------020007010202060701040906
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hi Martin,
I have written scrapers for a number of different sites, all of which
require sign on, and have not had any problems with redirection not
being performed. Hmm. Haven't done anything with sunrise yet (although I
would like to automatically download my faxes from my sunrise onebox).
One of the best tips I've gotten from this list is to put "use
LWP::Debug qw(+); " in to your code. This turns on a trace so you can
see what is happening.
Have you looked at LWP::UserAgent::max_redirect()? By default (unless
Mechanize changes this), it will only follow 7 redirects. By default it
does not follow redirects in response to a post (which it is also not
supposed to do, according to RFC).
Hope this helps & please do let me know how you fix it!
Cheers,
Peter
Martin Kos wrote:
> hi
>
> as i've seen that this list is used for mechanize questions too i'll
> try my question...
>
> i try to login to the page http://mymobile.sunrise.ch/ but it seems
> like mechanize is not doing the redirect that is on the start site...
> if i try with my browser or wget i get redirect to a page like
> http://mymobile.sunrise.ch/portal/res/guest;jsessionid=HCCIS J1USYYSVQFIGZAXRAQ?paf_dm=full&paf_gear_id=100001&?successUR L=/portal/res/member%3Bjsessionid%3DHCCISJ1USYYSVQFIGZAXRAQ
>
>
> i tried it with a simple "get" but it doesn't work and i don't see
> what the problem could be... any idea what i'm doing wrong?
>
> btw. im using mechanize 1.08
>
> regards
> KoS
--------------020007010202060701040906--
Re: Mechanize - redirect problem
am 23.02.2005 18:51:09 von martin
hi peter
> I have written scrapers for a number of different sites, all of which
require sign on, and have not had any problems with redirection not
being performed. Hmm. Haven't done anything with sunrise yet (although I
would like to automatically download my faxes from my sunrise onebox).
i have made a script to sent SMS over their site (so i don't need to login
myself and i can use my local address book instead of entering all
addresses in their online address book) and it worked fine until a week
ago, it stopped working and i don't see why.
> One of the best tips I've gotten from this list is to put "use
> LWP::Debug qw(+); " in to your code. This turns on a trace so you can
see what is happening.
hey thanks! that helped a lot... now i see that mechanize get the first page
GET http://mymobile.sunrise.ch/
and then gets redirected to
GET http://mymobile.sunrise.ch/portal/res/member
and then it gets a cookie
extract_cookies: Set cookie JSESSIONID => ERXFXSMIMG5ZBQFIGZAXRA
and it goes to the right URL
GET
http://mymobile.sunrise.ch/portal/res/guest;jsessionid=ERXFX SMIMG5ZBQFIGZAXRAQ?paf_dm=full&paf_gear_id=100001&?successUR L=/portal/res/member%3Bjsessionid%3DERXFXSMIMG5ZBQFIGZAXRAQ
but instead of stopping at the url it gets an additional URL
GET http://mobile.sunrise.ch/atg_500
and this site shows me an error site of sunrise.
if i enter the long URL in my browser i see the normal login page that i
should see!
> Have you looked at LWP::UserAgent::max_redirect()? By default (unless
Mechanize changes this), it will only follow 7 redirects. By default it
does not follow redirects in response to a post (which it is also not
supposed to do, according to RFC).
no it's definetly not the max_redirect, as it does to MUCH redirects ;-)
> Hope this helps & please do let me know how you fix it!
could you try a simple get on http://mymobile.sunrise.ch/ and see if you
get the login-page instead of http://mobile.sunrise.ch/atg_500. perhaps i
have a problem with my mechanize version (debian unstable)
greets
KoS
btw.: i've just signed up for a /ch/open membership ;-)
--
Martin Kos +41-76-384-93-33
http://kos.li Say NO to HTML in mail ICQ# 13556143
Proudly running Debian GNU/Linux
Re: Mechanize - redirect problem
am 23.02.2005 19:01:31 von Andy
On Wed, Feb 23, 2005 at 06:51:09PM +0100, Martin Kos (martin@kos.li) wrote:
> > One of the best tips I've gotten from this list is to put "use
> > LWP::Debug qw(+); " in to your code. This turns on a trace so you can
Can one of you guys please write up a paragraph on that LWP::Debug trick
so that I can drop it in the FAQ? I didn't even know about it.
Thanks,
xoxo,
Andy
--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance
Re: Mechanize - Scraping Tips
am 23.02.2005 22:52:16 von peter.stevens
--------------030700050200080709030404
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hi Andy,
Here are some of the problems that I have had which were real
brain-teasers along with some thoughts on how to solve them, including
the LWP::Debug trick. Feel free to republish, but please cite the source.
Cheers,
Peter
Q: How do I figure out why $mech->get($url) doesn't work, times out, or
has other strange problems ?
A: There are many reasons why a get can fail. The server can take you to
someplace you didn't expect. It can generate redirects which are not
properly handled. You can get time outs. Servers are down more often
than you think! Etc. A couple of places to start:
1. Check $mech->status() after each call
2. Check the URL with $mech->uri() to see where you ended up
3. If things are really strange, turn on debugging with
use LWP::Debug qw(+);
Just put this in the main program. This causes LWP to print out a
trace of the HTTP traffic between client and server and can be
used to figure out what is happening at the protocol level.
It is also useful to set many traps to verify that processing is
proceeding as expected. A $mech program should always have an "I didn't
expect to get here" or "I don't recognize the page that I am processing"
case and bail out.
Since errors can be transient, by the time you notice that the error has
occurred, it might not be possible to reproduce it manually. So for
automated processing it is useful to email yourself the following
information:
* $CLASS ( or other indication of where processing is taking place )
* An Error Message
* $mech->uri
* $mech->content
Q: I want to post a form. I filled it out and submitted it with
$mech->submit(). But the server ignored everything and resent an empty
form. What happened?
A: The post is handled by application software. It is common for PHP
programmers to use the same file both to display a form and to process
the arguments returned. So the first task of the application programmer
is to decide whether there are arguments to processes. The program can
check whether a particular parameter has been set, whether a hidden
parameter has been set, or whether the submit button has been clicked.
(There are probably other ways that I haven't thought of).
In any case, if your form is not setting the parameter (e.g. the submit
button) which the web application is keying on (and as an outsider there
is no way to know what it is keying on), it will not notice that the
form has been submitted. So - try using $mech->click() instead of
submit() or vice-versa.
Q: I seemed to have logged in successfully to the server, but when I try
to access protected content I get HTTP 500 (Internal Server) or other
strange errors. Why?
A: Some web sites use distributed databases for their processing. It can
take a few seconds for the login/session information to percolate
through to all the servers. For human users with their slow reaction
times, this is not a problem, but a Perl script can outrun the server.
So try adding a sleep(5) between logging in and actually doing anything
(the optimal delay must be determined experimentally).
Q: I am trying to scrape a page that uses Javascript. The links are
Javascript functions instead of URLs. What can I do?
A: Since Javascript is completely visible to the client, it cannot be
used to prevent a scraper from following links. But it can make life
difficult, and until someone writes a Javascript interpreter for Perl or
a mechanize clone to control Firefox, there will be no general solution.
But if you want to scrape specific pages, then a solution is always
possible.
One typical use of Javascript is to perform argument checking before
posting to the server. The URL you want is probably just buried in the
Javascript function. Do a regular expression match on $mech->content()
to find the link that you want and $mech->get it directly (this assumes
that you know what your are looking for in advance).
In more difficult cases, the Javascript is used for URL mangling to
satisfy the needs of some middleware. In this case you need to figure
out what the Javascript is doing (why are these URLs always really
long?). There is probably some function with one or more arguments which
calculates the new URL. Step one: using your favorite browser, get the
before and after URLs and save them to files. Edit each file, converting
the the argument separators ('?', '&' or ';') into newlines. Now it is
easy to use diff or comm to find out what Javascript did to the URL.
Step 2 - find the function call which created the URL - you will need to
parse and interpret its argument list. Using the Javascript Debugger
Extension for Firefox may help with the analysis. At this point, it is
fairly trivial to write your own function which emulates the Javascript
for the pages you want to process.
Andy Lester wrote:
>On Wed, Feb 23, 2005 at 06:51:09PM +0100, Martin Kos (martin@kos.li) wrote:
>
>
>>>One of the best tips I've gotten from this list is to put "use
>>>LWP::Debug qw(+); " in to your code. This turns on a trace so you can
>>>
>>>
>
>Can one of you guys please write up a paragraph on that LWP::Debug trick
>so that I can drop it in the FAQ? I didn't even know about it.
>
>Thanks,
>xoxo,
>Andy
>
>
>
--
------------------------------------------------------------ ----------
Peter Stevens Phone: +41 43 535 8517
www.MinuteWatcher.com Fax: +41 44 544 8392
--------------030700050200080709030404--
Re: Mechanize - Scraping Tips
am 23.02.2005 23:07:23 von martin
hi andy
i have had a problem with javascript today... the javascript added some
hidden fields that were not in the html and i have had to find a way to
add them with mechanize. i have the following code in the ML archive and
it just worked fine! so you could add this example....
$agent->form(3);
my (%attrib);
my $form = $agent->current_form();
$attrib{name} = "searchMode";
$attrib{value} = "propertyid";
$form->push_input("radio", \%attrib);
$agent->field("searchMode" , "propertyid");
$attrib{name} = "propertyId";
$attrib{value} = "";
$form->push_input("text", \%attrib);
$agent->field("propertyId" , $prop_cd);
original post:
http://groups.google.ch/groups?th=fc49d619bb58cb4&seekm=Pine .LNX.4.44.0406101340440.5792-100000%40fnord.io.com
regards
KoS
ps.: problem with the sunrise login-page still not solved.... i'll see
if the problem could be solved by a sleep(x) ....
--
Martin Kos +41-76-384-93-33
http://kos.li Say NO to HTML in mail ICQ# 13556143
Proudly running Debian GNU/Linux
Re: Mechanize - redirect problem
am 24.02.2005 18:39:51 von peter.stevens
Hi Martin,
Just looked at your script description again - It sounds quite useful!
Have you thought of posting it on Swissforge? It would be a great "Code
tidbit".
Peter
Martin Kos wrote:
>i have made a script to sent SMS over their site (so i don't need to login
>myself and i can use my local address book instead of entering all
>addresses in their online address book) and it worked fine until a week
>ago, it stopped working and i don't see why.
>
>
>
Re: Mechanize - redirect problem
am 24.02.2005 20:17:24 von martin
hi peter
> Just looked at your script description again - It sounds quite useful!
the script started to be a simple script for a mail2sms gateway so i
could get sms whenever i receive a mail (after spam scanning and mail
filtering ;-)). in the meantime i've extended it so that i can choose
which "provider" i'll use for sending the sms and some trim and
multi-sms functions.
> Have you thought of posting it on Swissforge? It would be a great "Code
> tidbit".
swissforge? haven't seen this one ;-) ... why a swiss sourcforge?
i don't know if it is a good idea to post the script on swissforge, not
that i don't want to give the script to other people, but the problem is
that i use it for sending sms over sunrise/orange/eth web-portals and
the people over there wouldn't be happy if they see that i circumvent
their interfaces?.... and if to much people do the same they'll start to
change the interface so that i gets harder (or impossible) to use it
with a "simple" script.... what do you think?
greets
KoS
--
Martin Kos +41-76-384-93-33
http://kos.li Say NO to HTML in mail ICQ# 13556143
Proudly running Debian GNU/Linux
Re: Mechanize - redirect problem
am 24.02.2005 23:30:10 von jjl
On Tue, 22 Feb 2005, Martin Kos wrote:
[...]
> i try to login to the page http://mymobile.sunrise.ch/ but it seems like
> mechanize is not doing the redirect that is on the start site... if i
> try with my browser or wget i get redirect to a page like
> http://mymobile.sunrise.ch/portal/res/guest;jsessionid=HCCIS J1USYYSVQFIGZAXRAQ?paf_dm=full&paf_gear_id=100001&?successUR L=/portal/res/member%3Bjsessionid%3DHCCISJ1USYYSVQFIGZAXRAQ
>
> i tried it with a simple "get" but it doesn't work and i don't see what
> the problem could be... any idea what i'm doing wrong?
It wants this header (or similar, but this is a minimal one):
Accept: text/html
Maybe mechanize should sent an Accept header by default?
BTW, Martin: I debugged this by just looking at what Firefox sends. Get
livehttpheaders.
John
Re: Mechanize - redirect problem
am 25.02.2005 00:05:32 von martin
hi john
> It wants this header (or similar, but this is a minimal one):
> Accept: text/html
i have added this header and it just works!!! thanks a LOT!
> Maybe mechanize should sent an Accept header by default?
i think that would be a good idea for the text/html type.
> BTW, Martin: I debugged this by just looking at what Firefox sends. Get
> livehttpheaders.
very handy firefox-plugin! i haven't knew it before.
how have you "see" that mechanize is missing the accept-header and that
the servers "needs" it ? was it only a guessing because firefox sends it?
regards
KoS
--
Martin Kos +41-76-384-93-33
http://kos.li Say NO to HTML in mail ICQ# 13556143
Proudly running Debian GNU/Linux
Re: Mechanize - redirect problem
am 25.02.2005 10:25:35 von peter.stevens
--------------020709020201050802050209
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hey John,
Nice detective work! I could imagine that both the solution and the
method for finding it will help explain a lot of the "It works with xxx
but not with mech" problems.
Andy - this really belongs in your tips & tricks page. BTW - where is
the page on the net? Do you mean the FAQ?
Cheers,
Peter
John J Lee wrote:
>On Tue, 22 Feb 2005, Martin Kos wrote:
>[...]
>
>
>>i try to login to the page http://mymobile.sunrise.ch/ but it seems like
>>mechanize is not doing the redirect that is on the start site... if i
>>try with my browser or wget i get redirect to a page like
>>http://mymobile.sunrise.ch/portal/res/guest;jsessionid=HCC ISJ1USYYSVQFIGZAXRAQ?paf_dm=full&paf_gear_id=100001&?success URL=/portal/res/member%3Bjsessionid%3DHCCISJ1USYYSVQFIGZAXRA Q
>>
>>i tried it with a simple "get" but it doesn't work and i don't see what
>>the problem could be... any idea what i'm doing wrong?
>>
>>
>
>It wants this header (or similar, but this is a minimal one):
>
>Accept: text/html
>
>
>Maybe mechanize should sent an Accept header by default?
>
>BTW, Martin: I debugged this by just looking at what Firefox sends. Get
>livehttpheaders.
>
>
>John
>
>
>
--------------020709020201050802050209--
Re: Mechanize - redirect problem
am 25.02.2005 16:21:22 von Andy
On Fri, Feb 25, 2005 at 10:25:35AM +0100, Peter Stevens (peter.stevens@ch-open.ch) wrote:
> Andy - this really belongs in your tips & tricks page. BTW - where is
> the page on the net? Do you mean the FAQ?
I never said tips & tricks. There's a Cookbook.pod and a FAQ.pod, both
shipping with Mech.
--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance
Re: Mechanize - redirect problem
am 25.02.2005 17:46:44 von peter.stevens
--------------080408030804020204030301
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Hi Andy,
Oops, sorry, I misquoted you. Whether FAQ, Cookbook or something new, I
think this kind of practical hints on how to use mech would be
tremendously valuable (also the reason I sent you some extra questions
and their answers in response to your request).
BTW - I don't think that cookbooks are real helpful. By the time they
get to the users who need them, the websites have changed enough that
they recipes don't work anymore. This was a big problem with the
O'Reilly book. I bought it, but the examples I looked at didn't work.
Cheers,
Peter
Andy Lester wrote:
>On Fri, Feb 25, 2005 at 10:25:35AM +0100, Peter Stevens (peter.stevens@ch-open.ch) wrote:
>
>
>>Andy - this really belongs in your tips & tricks page. BTW - where is
>>the page on the net? Do you mean the FAQ?
>>
>>
>
>I never said tips & tricks. There's a Cookbook.pod and a FAQ.pod, both
>shipping with Mech.
>
>
>
--
------------------------------------------------------------ ----------
Peter Stevens Phone: +41 43 535 8517
www.MinuteWatcher.com Fax: +41 44 544 8392
--------------080408030804020204030301--
Re: Mechanize - redirect problem
am 25.02.2005 18:19:39 von Andy
On Fri, Feb 25, 2005 at 05:46:44PM +0100, Peter Stevens (peter.stevens@ch-open.ch) wrote:
> Oops, sorry, I misquoted you. Whether FAQ, Cookbook or something new, I
> think this kind of practical hints on how to use mech would be
> tremendously valuable (also the reason I sent you some extra questions
> and their answers in response to your request).
I agree. Send me something and I'll drop it in.
xoa
--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance
Re: Mechanize - redirect problem
am 01.03.2005 13:40:00 von jjl
On Fri, 25 Feb 2005, Martin Kos wrote:
> hi john
>
> > It wants this header (or similar, but this is a minimal one):
> > Accept: text/html
> i have added this header and it just works!!! thanks a LOT!
>
> > Maybe mechanize should sent an Accept header by default?
> i think that would be a good idea for the text/html type.
>
> > BTW, Martin: I debugged this by just looking at what Firefox sends. Get
> > livehttpheaders.
> very handy firefox-plugin! i haven't knew it before.
> how have you "see" that mechanize is missing the accept-header and that
> the servers "needs" it ? was it only a guessing because firefox sends it?
1. Blindly copied firefox headers that I noticed mechanize (in fact,
Python httplib/urllib2/mechanize) didn't send, or had obviously different
values (the latter, in the case of Accept).
2. Saw that it now worked.
3. Deleted hdrs until it stopped working again :-)
John