scraping amazon
am 16.06.2005 02:22:50 von cawildma
Need some help.
I am trying to scrape a page on amazon that lists all the sellers selling a
given book.
For some reason making a request to this page(see below) with LWP returns
successful but there is never any content. Many other pages work. I can get
back the product detail page or the customer review page, etc.
Amazon has a maze for trying to figure out their URLs. This url alone you
can write 5 ways that I have found so far. Maybe this is because they don't
want people grabbing this page. I do not understand how my browser can get
it though and LWP can't. Its still just html.
I realize these links redirect. But ones other than the store listing page
do as well and they still return the html.
Please let me know if you have a clue as to what is going on.
so for example:
use LWP::UserAgent;
$ua = LWP::UserAgent->new;
$ua->env_proxy;
$req =
$ua->get('http://amazon.com/o/tg/stores/offering/list/-/0596 004478'); //
this link doesn't return content to lwp but does work in browser
# check the outcome
if ($req->is_success)
{
$_ = $req->content;
print $_;
}
else
{
print "Error: " . $req->status_line . "\n";
}
Chris Wildman
cawildma@ucsd.edu
Re: scraping amazon
am 16.06.2005 03:30:17 von apv
There are quite a few Amazon modules (I wrote my own so can't vouch for
any on CPAN) and if you get a dev kit you can hack away through
approved and fixed APIs.
http://search.cpan.org/search?query=amazon
http://www.amazon.com/gp/browse.html/?node=3435361
-Ashley
On Wednesday, June 15, 2005, at 06:59 PM, David Robins wrote:
> On Wednesday June 15, 2005 17:22, Chris Wildman wrote:
>> I am trying to scrape a page on amazon that lists all the sellers
>> selling a
>> given book.
>
> Note that scraping pages may be against their terms of service
> (stupid, I
> know, but it could be); if you're just grabbing data for personal use
> they
> probably won't care.
>
>> For some reason making a request to this page(see below) with LWP
>> returns
>> successful but there is never any content. Many other pages work. I
>> can get
>> back the product detail page or the customer review page, etc.
>
> You might want to set the User-Agent to something browser-y. If that
> doesn't
> work, diff the headers your browser sends and the ones LWP sends and
> alter
> until you find the differentiator.
>
> --
> Dave
> Isa. 40:31
>
>
Re: scraping amazon
am 16.06.2005 03:59:19 von dbrobins
On Wednesday June 15, 2005 17:22, Chris Wildman wrote:
> I am trying to scrape a page on amazon that lists all the sellers selling a
> given book.
Note that scraping pages may be against their terms of service (stupid, I
know, but it could be); if you're just grabbing data for personal use they
probably won't care.
> For some reason making a request to this page(see below) with LWP returns
> successful but there is never any content. Many other pages work. I can get
> back the product detail page or the customer review page, etc.
You might want to set the User-Agent to something browser-y. If that doesn't
work, diff the headers your browser sends and the ones LWP sends and alter
until you find the differentiator.
--
Dave
Isa. 40:31
Re: scraping amazon
am 28.06.2005 00:46:00 von steves06
David Robins wrote:
> You might want to set the User-Agent to something browser-y. If that doesn't
> work, diff the headers your browser sends and the ones LWP sends and alter
> until you find the differentiator.
I get content when I set my user agent to Mozilla, so that's
probably it. When I set it to a value we use internally
for automated GET clients, Amazon gives me a 204 error.
--
Steve Sapovits steves06@comcast.net