site spider using mech

am 03.08.2005 19:30:56 von lists

Hey List-

using Perl 5.8 and the most recent mech release.

I wrote a site spider (internal site) that searches through all the HTML links
and looks at each page for various pieces of information.

Problem I'm running into is links that link to large PDF or PPT files and them
clogging up the works. I'm trying to figure out how to just download the
headers so I can determine if the file is HTML or not (via $mech->is_html() )
and if it isn't just skip it.

It seems the $mech->get($url) method still loads the whole file before I can
just look at the headers. The Mech docs say that the get function is
overloaded from the UserAgent function. Now, reading the docs on that it has
a size limiter. How can I limit the size of the download? I don't really
care as long as I just grab the headers.

I was also trying to move in the direction of just creating a new UserAgent
object and using the HTTP::Request function to grab just the headers.

Anyone have any better ideas?

Thanks.

Henrik
--
Henrik Hudson
lists@rhavenn.net

RTFM: Not just an acronym, it's the LAW!

Re: site spider using mech

am 03.08.2005 20:27:52 von jik

Use $mech->head instead of $mech->get and look at the content type of
the resulting response to find out whether it's something you want to
download.