WWW::Mechanize : Is immediate caching of images possible?

WWW::Mechanize : Is immediate caching of images possible?

am 04.01.2008 23:12:30 von hikari.no.hohenheim

Traditionally when using WWW::Mechanize to dl images I first fetch the
root page:
my $mech = WWW::Mechanize->new();
$mech->get($url);

then proceed to find all images and 'get' them one by one: (forgive
the crude code)

my @links = $mech->find_all_images();
foreach my $link (@links){
my $imageurl = $link->url_abs();
$imageurl =~ m/([^\/]+)$/;
$mech->get($imageurl, ':content_file' => $1);
}

My current problem with this is that I'm trying to dl an image
generated with information from the session of the original
get($url). It's not a static *.jpg or something simple it's a black
box that displays an image relevant to the session. Meaning, when I
fetch the image (http://www.domain.com/image/ which is embedded in the
page) as shown above, it's a new request and I get a completely random
image.

Is there a way to cache the images that are loaded during the initial
get($url) so that the image matches the content of the page
retrieved? Or even to capture the session information transmitted to
the black box, domain.com/image/, so I can clone the information and
submit it with the get($imageurl)?

Ideally I would effectively like a routine like: $mech-
>getComplete($url,$directory); which would save the source and images/
etc associate with the page. Analogous to the Save-> Web page,
Complete in Firefox.

Thanks all. I think I'm getting pretty proficient with WWW::Mechanize
but don't be afraid to respond like I am an idiot so that we know your
answer doesn't go over my head.

Hikari

Re: WWW::Mechanize : Is immediate caching of images possible?

am 04.01.2008 23:45:14 von Joost Diepenmaat

hikari.no.hohenheim@gmail.com writes:

> Traditionally when using WWW::Mechanize to dl images I first fetch the
> root page:
> my $mech = WWW::Mechanize->new();
> $mech->get($url);
>
> then proceed to find all images and 'get' them one by one: (forgive
> the crude code)
>
> my @links = $mech->find_all_images();
> foreach my $link (@links){
> my $imageurl = $link->url_abs();
> $imageurl =~ m/([^\/]+)$/;
> $mech->get($imageurl, ':content_file' => $1);
> }

Ok...

> My current problem with this is that I'm trying to dl an image
> generated with information from the session of the original
> get($url). It's not a static *.jpg or something simple it's a black
> box that displays an image relevant to the session. Meaning, when I
> fetch the image (http://www.domain.com/image/ which is embedded in the
> page) as shown above, it's a new request and I get a completely random
> image.

Since an HTTP request only gets one resource at a time, how does the server
associate them?

> Is there a way to cache the images that are loaded during the initial
> get($url) so that the image matches the content of the page
> retrieved?

The content of the page has no bearing of the content of the image. At
least not from the client's point of view.

> Or even to capture the session information transmitted to
> the black box, domain.com/image/, so I can clone the information and
> submit it with the get($imageurl)?

I don't know what you mean. Session /IDs/ are usually handled either via
cookies or via additional URL parameters/substrings. Either case should
already work automatically with WWW::Mechanize. There is no way at all
that you can get at session information from a client unless the server
provides special mechanisms to do so, which they normally don't, since
one of the reasons to use sessions in the first place is to separate the
client from the session data.

> Ideally I would effectively like a routine like: $mech-
>>getComplete($url,$directory); which would save the source and images/
> etc associate with the page. Analogous to the Save-> Web page,
> Complete in Firefox.

That does more or less exactly what your script above does. But see below.

> Thanks all. I think I'm getting pretty proficient with WWW::Mechanize
> but don't be afraid to respond like I am an idiot so that we know your
> answer doesn't go over my head.

It looks to me that the server generating the image does some kind of
hackish thing to associate the image with the original request. Since
WWW::Mechanize already handles session cookies automatically, I'd guess
you may need to get the image making sure that your referer header is
set to the same page. Everything else I can think of should already work
automatically. Usually, just using the back() method would suffice.

Something like this:

$www->get($page_url);
foreach (link on that page) {
$www->get($_);
# do stuff
$www->back(); # set originating page to $page_url
}

HTH,
Joost.

Re: WWW::Mechanize : Is immediate caching of images possible?

am 05.01.2008 00:24:36 von John Bokma

Joost Diepenmaat wrote:

> I don't know what you mean. Session /IDs/ are usually handled either
> via cookies or via additional URL parameters

I guess you mean query string

> /substrings.

I guess you mean path segements (path_info)

> Either case

There is a 3rd one: (hidden) fields in a form

--
John

Arachnids near Coyolillo - part 1
http://johnbokma.com/mexit/2006/05/04/arachnids-coyolillo-1. html

Re: WWW::Mechanize : Is immediate caching of images possible?

am 05.01.2008 02:00:58 von Tad J McClellan

hikari.no.hohenheim@gmail.com wrote:


> (forgive
> the crude code)


It is a bit too crude to let pass without comment...


> my @links = $mech->find_all_images();
> foreach my $link (@links){
> my $imageurl = $link->url_abs();
> $imageurl =~ m/([^\/]+)$/;
> $mech->get($imageurl, ':content_file' => $1);


You should never use the dollar-digit variables unless you have
first ensured that the match *succeeded*.


> fetch the image (http://www.domain.com/image/


Your match will fail for that value of $imageurl for instance...


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Re: WWW::Mechanize : Is immediate caching of images possible?

am 05.01.2008 05:08:11 von hikari.no.hohenheim

On Jan 4, 5:45 pm, Joost Diepenmaat wrote:
> Since an HTTP request only gets one resource at a time, how does the server
> associate them?
>
> The content of the page has no bearing of the content of the image. At
> least not from the client's point of view.


Thanks, Joost, and everyone.

I've figured out a bit more; essentially the root page (www.domain.com/
index.php) that has an image tag:

.

Turns out this is also a php ( */image/index.php). So, it's the php
that returns the image and coordinates the content of the page and the
image chosen. If I open the main page in a browser the image matches
the content of the page. But when I $mech->get() the main page then
subsequently $mech->get() the */image/ (or */image/index.php) I get a
different image.

I have little to no experience with php. I know it's done server side
so I can't view the source, but I don't know what information is
different or changes when I $mech->get() the page and image
separately.

I'm probably out of my league, but if you have any thoughts I'd love
to hear them.

-Hikari

Re: WWW::Mechanize : Is immediate caching of images possible?

am 05.01.2008 14:26:01 von hjp-usenet2

On 2008-01-05 04:08, hikari.no.hohenheim@gmail.com wrote:
> I've figured out a bit more; essentially the root page (www.domain.com/
> index.php) that has an image tag:
>
> .
>
> Turns out this is also a php ( */image/index.php). So, it's the php
> that returns the image and coordinates the content of the page and the
> image chosen. If I open the main page in a browser the image matches
> the content of the page. But when I $mech->get() the main page then
> subsequently $mech->get() the */image/ (or */image/index.php) I get a
> different image.
>
> I have little to no experience with php. I know it's done server side
> so I can't view the source, but I don't know what information is
> different or changes when I $mech->get() the page and image
> separately.

All the information must be explicitely sent by the client (be it the
browser or your script). So you can use a packet sniffer
(e.g., wireshark) to compare the requests and see what your script does
differently than the browser.

One possibility is that the HTML page contains some JavaScript which
modifies the request - in this case you will have to analyze the
JavaScript and rewrite it in Perl.

hp

Re: WWW::Mechanize : Is immediate caching of images possible?

am 05.01.2008 14:57:47 von Joost Diepenmaat

John Bokma writes:

> Joost Diepenmaat wrote:
>
>> I don't know what you mean. Session /IDs/ are usually handled either
>> via cookies or via additional URL parameters
>
> I guess you mean query string
>
>> /substrings.
>
> I guess you mean path segements (path_info)

Actually, I meant to say that the server can encode the session id in
the url in any way it wants. using path_info or query strings can
be convenient, but you can't be sure how a server interprets a URL
at all.

>> Either case
>
> There is a 3rd one: (hidden) fields in a form

True, but probably not used here since the thread is about s.

Joost.

Re: WWW::Mechanize : Is immediate caching of images possible?

am 05.01.2008 15:54:54 von John Bokma

Joost Diepenmaat wrote:

> John Bokma writes:
>
>> Joost Diepenmaat wrote:
>>
>>> I don't know what you mean. Session /IDs/ are usually handled either
>>> via cookies or via additional URL parameters
>>
>> I guess you mean query string
>>
>>> /substrings.
>>
>> I guess you mean path segements (path_info)
>
> Actually, I meant to say that the server can encode the session id in
> the url in any way it wants. using path_info or query strings can
> be convenient, but you can't be sure how a server interprets a URL
> at all.

session ids in URLs are obvious IMO.

>>> Either case
>>
>> There is a 3rd one: (hidden) fields in a form
>
> True, but probably not used here since the thread is about s.

A post request can return an image of course. But I just mentioned to be
complete.

--
John

Arachnids near Coyolillo - part 1
http://johnbokma.com/mexit/2006/05/04/arachnids-coyolillo-1. html

Re: WWW::Mechanize : Is immediate caching of images possible?

am 05.01.2008 23:09:45 von Tad J McClellan

hikari.no.hohenheim@gmail.com wrote:

> I'm probably out of my league, but if you have any thoughts I'd love
> to hear them.


Have you tried using the Web Scraping Proxy (wsp.pl) to capture
the requests that the browser is making?


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Re: WWW::Mechanize : Is immediate caching of images possible?

am 06.01.2008 03:35:03 von hikari.no.hohenheim

On Jan 5, 5:09 pm, Tad J McClellan wrote:
> Have you tried using the Web Scraping Proxy (wsp.pl) to capture
> the requests that the browser is making?

On Jan 5, 8:26 am, "Peter J. Holzer" wrote:
> All the information must be explicitely sent by the client (be it the
> browser or your script). So you can use a packet sniffer
> (e.g., wireshark) to compare the requests and see what your script does
> differently than the browser.

I'm checking out these options now. I don't have much experience with
either, but we'll see where it takes me.

The consensus seems to be there is no easy way to simulate a browser
in the sense of having all the incoming content be the result of the
initial request since by default mechanize gets the html source from
the first request and the image or other content as the result of a
second request.

Some way or another I'll have to figure out what information is hidden
from my eyes, and add it to the $mech->get() for the image with $mech-
>add_header or something. Sound about right? Or would anyone care to
articulate this more precisely?

Thanks again!

-Hikari

Re: WWW::Mechanize : Is immediate caching of images possible?

am 06.01.2008 04:25:17 von Ben Morrow

Quoth hikari.no.hohenheim@gmail.com:
> On Jan 5, 5:09 pm, Tad J McClellan wrote:
> > Have you tried using the Web Scraping Proxy (wsp.pl) to capture
> > the requests that the browser is making?
>
> On Jan 5, 8:26 am, "Peter J. Holzer" wrote:
> > All the information must be explicitely sent by the client (be it the
> > browser or your script). So you can use a packet sniffer
> > (e.g., wireshark) to compare the requests and see what your script does
> > differently than the browser.
>
> I'm checking out these options now. I don't have much experience with
> either, but we'll see where it takes me.
>
> The consensus seems to be there is no easy way to simulate a browser
> in the sense of having all the incoming content be the result of the
> initial request since by default mechanize gets the html source from
> the first request and the image or other content as the result of a
> second request.

This is exactly the same as what the browser does. There is no way[0]
for a single request to return two objects, so the browser makes two
requests as well. All you need to do is find out what the second request
should be; Tad's suggestion of using the WSP is probably the easiest way
to do that.

[0] mutipart/* aside, as that clearly doesn't apply here.

> Some way or another I'll have to figure out what information is hidden
> from my eyes, and add it to the $mech->get() for the image with $mech-
> >add_header or something. Sound about right? Or would anyone care to
> articulate this more precisely?

No, that sounds about right. Have you tried using $mech->back yet?
WWW::Mech tries quite hard to do all the hard work for you, but it can't
if you don't give it all the information.

Ben

Re: WWW::Mechanize : Is immediate caching of images possible?

am 06.01.2008 14:39:58 von Joost Diepenmaat

hikari.no.hohenheim@gmail.com writes:

> On Jan 5, 5:09 pm, Tad J McClellan wrote:
>> Have you tried using the Web Scraping Proxy (wsp.pl) to capture
>> the requests that the browser is making?
>
> On Jan 5, 8:26 am, "Peter J. Holzer" wrote:
>> All the information must be explicitely sent by the client (be it the
>> browser or your script). So you can use a packet sniffer
>> (e.g., wireshark) to compare the requests and see what your script does
>> differently than the browser.
>
> I'm checking out these options now. I don't have much experience with
> either, but we'll see where it takes me.
>
> The consensus seems to be there is no easy way to simulate a browser
> in the sense of having all the incoming content be the result of the
> initial request since by default mechanize gets the html source from
> the first request and the image or other content as the result of a
> second request.

No, well at least that's not how I see it: it doesn't matter what is
doing the requests for the images and page, your script or a browser,
it will always first get the page, and then get the images, each as
its own request. *Because you can't do it any other way in HTTP+HTML* If
your code does not give the same result, then either the server is doing
something tricky to associate the requests (which it probably shouldn't
do), or your code doesn't do what you think it does.

Tracing the requests/responses from the browser and your script may give
you an insight what's going wrong.

> Some way or another I'll have to figure out what information is hidden
> from my eyes, and add it to the $mech->get() for the image with $mech-
>>add_header or something. Sound about right? Or would anyone care to
> articulate this more precisely?

Yes, that sounds about right. You may also be interested in
HTTP::Recorder or plain HTTP::Proxy on CPAN.

Joost.

Re: WWW::Mechanize : Is immediate caching of images possible?

am 06.01.2008 18:23:14 von hikari.no.hohenheim

On Jan 6, 8:39 am, Joost Diepenmaat wrote:
> or your code doesn't do what you think it does.
>
> Tracing the requests/responses from the browser and your script may give
> you an insight what's going wrong.

It ended up being a bit of both but I have a working copy now. wsp
was excellent for diagnosing because it helped me see all the cookies,
session id's, and referer values I needed to double check.

And embarrassingly, I found that the way I was passing my variables
was somehow a bit sloppy and, though it worked with everything else,
it was loosing some cookie/session id information. So I hacked it
with a global variable last night (no need for a lecture) and it
worked, so I'm editing and cleaning up the way I pass the information
in my subroutines now.

All around a very valuable learning experience in a couple of
different categories. Thanks everyone!

(for the record the end result was effectively)
$mech->get($first);
$mech->follow_link($second); #this preserved the referer
$mech->get($imageurl);

Obviously there are a lot more complexities wrapped around it in my
implementation which is why I screwed it up, but looking at that I
feel pretty stupid right about now :-)

-Hikari