LWP module - parse one line at a time (only download part of a page)

LWP module - parse one line at a time (only download part of a page)

am 20.01.2006 19:50:56 von Alf McLaughlin

Hello-
My apologies if this is an old topic, but I did a lot of searching
first and couldn't quite find the best answer. Here is my problem
(very briefly):

I want to download a fairly large amount of data from a webpage
(~10MB), but the stuff I'm really interested in is always toward the
top of the page (however, I don't know exactly where). Since I'm only
interested in two or three lines, I don't want to download the whole
page. I would like download until I see what I want (such as my
$current_line =~ /WHAT I WANT/) and then kill the download.

The problem isn't that 10MB is such a big deal, but I have to call
different webpages for about 5000 of these things. Any advice would be
greatly appreciated.

Thanks,
Alf

Re: LWP module - parse one line at a time (only download part of a page)

am 20.01.2006 21:56:12 von nobull67

Alf McLaughlin wrote:

> I want to download a fairly large amount of data from a webpage
> (~10MB), but the stuff I'm really interested in is always toward the
> top of the page (however, I don't know exactly where). Since I'm only
> interested in two or three lines, I don't want to download the whole
> page. I would like download until I see what I want (such as my
> $current_line =~ /WHAT I WANT/) and then kill the download.

Read the description of the get() method of LWP::UserAgent.

In particular note the existance of the callback and the bit where it
says "The callback can abort the request by invoking die()."

Re: LWP module - parse one line at a time (only download part of a page)

am 20.01.2006 22:05:35 von Paul Lalli

Alf McLaughlin wrote:
> I want to download a fairly large amount of data from a webpage
> (~10MB), but the stuff I'm really interested in is always toward the
> top of the page (however, I don't know exactly where). Since I'm only
> interested in two or three lines, I don't want to download the whole
> page. I would like download until I see what I want (such as my
> $current_line =~ /WHAT I WANT/) and then kill the download.

I've never done this, but I wonder if these two references might point
you in the right direction:

http://www.w3.org/Protocols/rfc2616/rfc2616-sec14.html#sec14 .35
http://search.cpan.org/~gaas/libwww-perl-5.805/lib/LWP/UserA gent.pm#REQUEST_METHODS

Paul Lalli

Re: LWP module - parse one line at a time (only download part of a page)

am 21.01.2006 01:16:39 von xhoster

"nobull@mail.com" wrote:
> Alf McLaughlin wrote:
>
> > I want to download a fairly large amount of data from a webpage
> > (~10MB), but the stuff I'm really interested in is always toward the
> > top of the page (however, I don't know exactly where). Since I'm only
> > interested in two or three lines, I don't want to download the whole
> > page. I would like download until I see what I want (such as my
> > $current_line =~ /WHAT I WANT/) and then kill the download.
>
> Read the description of the get() method of LWP::UserAgent.

I think you mean request() rather than get().

> In particular note the existance of the callback and the bit where it
> says "The callback can abort the request by invoking die()."

This method is the direct answer to the OPs question, but he will have to
be careful to account for the chance that his desired string will span a
chunk boundary.

I think a simpler but less rigorous option would be to set the
$ua->max_size to his best guess of a upper limit on how far into the
response the desired string can be. But there is always the danger that
the upper limit turns out to be set too low, and you miss things that the
callback method would find. Of course, there is the corresponding hazard
that the guess will be set too high, and he will still be reading far more
data than necessary.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
Usenet Newsgroup Service $9.95/Month 30GB