web crawling for books

web crawling for books

am 25.11.2007 10:58:25 von alexxx.magni

I have a large list of my library's books,
and I would like to setup a Perl spider, going on the web for each
author/title information, and returning useful info I didnt put into
the records (editor, year, topic, isbn, ...).
I already wrote down the basic spider's structure, but I'm not sure
which site is more apt to such a search (considering also that its
robots.txt should allow me access).
Which site would you suggest for such a task?

Thank you!


Alessandro Magni

Re: web crawling for books

am 25.11.2007 14:56:47 von Sherm Pendley

"alexxx.magni@gmail.com" writes:

> I have a large list of my library's books,
> and I would like to setup a Perl spider, going on the web for each
> author/title information, and returning useful info I didnt put into
> the records (editor, year, topic, isbn, ...).
> I already wrote down the basic spider's structure, but I'm not sure
> which site is more apt to such a search (considering also that its
> robots.txt should allow me access).
> Which site would you suggest for such a task?

I'd suggest using Amazon's Web Services API for that. It'll give you back
structured data, which will be far easier to deal with than spidering sites
and scraping HTML.

Have a look at the Net::Amazon module on CPAN.

sherm--

--
WV News, Blogging, and Discussion: http://wv-www.com
Cocoa programming in Perl: http://camelbones.sourceforge.net

Re: web crawling for books

am 25.11.2007 14:59:41 von Spiros Denaxas

On Nov 25, 9:58 am, "alexxx.ma...@gmail.com"
wrote:
> I have a large list of my library's books,
> and I would like to setup a Perl spider, going on the web for each
> author/title information, and returning useful info I didnt put into
> the records (editor, year, topic, isbn, ...).
> I already wrote down the basic spider's structure, but I'm not sure
> which site is more apt to such a search (considering also that its
> robots.txt should allow me access).
> Which site would you suggest for such a task?
>
> Thank you!
>
> Alessandro Magni

Hi,

speaking from experience, I think you will be able to obtain higher
quality results which are more relevant using API's instead of just
scraping sites. For example, check out the Amazon Web Services API at
http://www.amazon.com/AWS-home-page-Money/b?ie=UTF8&node=343 5361
You could also potentially use http://books.google.com/.

Spiros

Re: web crawling for books

am 28.11.2007 20:49:39 von Adam Funk

On 2007-11-25, alexxx.magni@gmail.com wrote:

> I have a large list of my library's books,
> and I would like to setup a Perl spider, going on the web for each
> author/title information, and returning useful info I didnt put into
> the records (editor, year, topic, isbn, ...).
> I already wrote down the basic spider's structure, but I'm not sure
> which site is more apt to such a search (considering also that its
> robots.txt should allow me access).
> Which site would you suggest for such a task?

You might want to look at Alexandria, which already does quite a bit
of this. It's written in Ruby, but the source code might give you
some ideas.

http://alexandria.rubyforge.org/