web crawling for books
am 25.11.2007 10:58:25 von alexxx.magni
I have a large list of my library's books,
and I would like to setup a Perl spider, going on the web for each
author/title information, and returning useful info I didnt put into
the records (editor, year, topic, isbn, ...).
I already wrote down the basic spider's structure, but I'm not sure
which site is more apt to such a search (considering also that its
robots.txt should allow me access).
Which site would you suggest for such a task?
Thank you!
Alessandro Magni
Re: web crawling for books
am 25.11.2007 14:56:47 von Sherm Pendley
"alexxx.magni@gmail.com" writes:
> I have a large list of my library's books,
> and I would like to setup a Perl spider, going on the web for each
> author/title information, and returning useful info I didnt put into
> the records (editor, year, topic, isbn, ...).
> I already wrote down the basic spider's structure, but I'm not sure
> which site is more apt to such a search (considering also that its
> robots.txt should allow me access).
> Which site would you suggest for such a task?
I'd suggest using Amazon's Web Services API for that. It'll give you back
structured data, which will be far easier to deal with than spidering sites
and scraping HTML.
Have a look at the Net::Amazon module on CPAN.
sherm--
--
WV News, Blogging, and Discussion: http://wv-www.com
Cocoa programming in Perl: http://camelbones.sourceforge.net
Re: web crawling for books
am 25.11.2007 14:59:41 von Spiros Denaxas
On Nov 25, 9:58 am, "alexxx.ma...@gmail.com"
wrote:
> I have a large list of my library's books,
> and I would like to setup a Perl spider, going on the web for each
> author/title information, and returning useful info I didnt put into
> the records (editor, year, topic, isbn, ...).
> I already wrote down the basic spider's structure, but I'm not sure
> which site is more apt to such a search (considering also that its
> robots.txt should allow me access).
> Which site would you suggest for such a task?
>
> Thank you!
>
> Alessandro Magni
Hi,
speaking from experience, I think you will be able to obtain higher
quality results which are more relevant using API's instead of just
scraping sites. For example, check out the Amazon Web Services API at
http://www.amazon.com/AWS-home-page-Money/b?ie=UTF8&node=343 5361
You could also potentially use http://books.google.com/.
Spiros
Re: web crawling for books
am 28.11.2007 20:49:39 von Adam Funk
On 2007-11-25, alexxx.magni@gmail.com wrote:
> I have a large list of my library's books,
> and I would like to setup a Perl spider, going on the web for each
> author/title information, and returning useful info I didnt put into
> the records (editor, year, topic, isbn, ...).
> I already wrote down the basic spider's structure, but I'm not sure
> which site is more apt to such a search (considering also that its
> robots.txt should allow me access).
> Which site would you suggest for such a task?
You might want to look at Alexandria, which already does quite a bit
of this. It's written in Ruby, but the source code might give you
some ideas.
http://alexandria.rubyforge.org/