Large-scale spidering

Large-scale spidering

am 14.04.2006 04:16:02 von marvin

Greets,

I've written an industrial-strength search engine library for Perl
(KinoSearch), and now I have clients who want me to work on a large-
scale spidering app for them. Sort of like Nutch for Perl ( lucene.apache.org/nutch>). Putch. :)

What efforts have already been undertaken in this area? A survey of
existing CPAN releases that I should study would be great. I've
written a small-scale spider using LWP::RobotUA. I've scanned over
the WWW::Mechanize docs, but don't yet grasp its full capabilities.
What else?

Thanks,

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/

Re: Large-scale spidering

am 14.04.2006 13:26:56 von jcook713

Plucene maybe? Its up on CPAN.

Justin

Marvin Humphrey wrote:

> Greets,
>
> I've written an industrial-strength search engine library for Perl
> (KinoSearch), and now I have clients who want me to work on a large-
> scale spidering app for them. Sort of like Nutch for Perl ( > lucene.apache.org/nutch>). Putch. :)
>
> What efforts have already been undertaken in this area? A survey of
> existing CPAN releases that I should study would be great. I've
> written a small-scale spider using LWP::RobotUA. I've scanned over
> the WWW::Mechanize docs, but don't yet grasp its full capabilities.
> What else?
>
> Thanks,
>
> Marvin Humphrey
> Rectangular Research
> http://www.rectangular.com/
>
>

Re: Large-scale spidering

am 15.04.2006 19:04:02 von marvin

On Apr 14, 2006, at 4:26 AM, J Cook wrote:

> Plucene maybe? Its up on CPAN.

I'm intimately acquainted with Plucene. I actually spent a week or
two hacking on it last August before deciding that its performance
issues could not be resolved without a complete overhaul which would
break the API.

http://www.rectangular.com/kinosearch/benchmarks.html

KinoSearch, like Plucene, is a text search engine library. In order
to write an industrial-strength spider a la Nutch, you need a lot
more than that: HTML::Parser, HTML::LinkExtor, LWP::RobotUA... I've
now discovered WWW::RobotRules::AnyDBM_File, which is going to be
very helpful. But there are a lot of other problems to be solved.
Check-summing page content to eliminate duplicate documents available
via multiple URLs. Managing crawl depth so that a spider doesn't
venture too deep into one domain and forget about all the others.
Eventually, if you want to get fancy, link analysis and pagerank.

LWP::Parallel::RobotUA looks interesting. There's a bunch of stuff
under Bundle::LinkController, but it hasn't been updated in a while.
What else?

Marvin Humphrey
Rectangular Research
http://www.rectangular.com/