scraping amazon

am 16.06.2005 02:22:50 von cawildma

Need some help.

I am trying to scrape a page on amazon that lists all the sellers selling a
given book.

For some reason making a request to this page(see below) with LWP returns
successful but there is never any content. Many other pages work. I can get
back the product detail page or the customer review page, etc.

Amazon has a maze for trying to figure out their URLs. This url alone you
can write 5 ways that I have found so far. Maybe this is because they don't
want people grabbing this page. I do not understand how my browser can get
it though and LWP can't. Its still just html.

I realize these links redirect. But ones other than the store listing page
do as well and they still return the html.

Please let me know if you have a clue as to what is going on.

so for example:

use LWP::UserAgent;
$ua = LWP::UserAgent->new;
$ua->env_proxy;

$req =
$ua->get('http://amazon.com/o/tg/stores/offering/list/-/0596 004478'); //
this link doesn't return content to lwp but does work in browser

# check the outcome

if ($req->is_success)
{
$_ = $req->content;
print $_;
}
else
{
print "Error: " . $req->status_line . "\n";
}

Chris Wildman
cawildma@ucsd.edu

Re: scraping amazon

am 16.06.2005 03:30:17 von apv

There are quite a few Amazon modules (I wrote my own so can't vouch for
any on CPAN) and if you get a dev kit you can hack away through
approved and fixed APIs.

http://search.cpan.org/search?query=amazon
http://www.amazon.com/gp/browse.html/?node=3435361

-Ashley

On Wednesday, June 15, 2005, at 06:59 PM, David Robins wrote:

> On Wednesday June 15, 2005 17:22, Chris Wildman wrote:
>> I am trying to scrape a page on amazon that lists all the sellers
>> selling a
>> given book.
>
> Note that scraping pages may be against their terms of service
> (stupid, I
> know, but it could be); if you're just grabbing data for personal use
> they
> probably won't care.
>
>> For some reason making a request to this page(see below) with LWP
>> returns
>> successful but there is never any content. Many other pages work. I
>> can get
>> back the product detail page or the customer review page, etc.
>
> You might want to set the User-Agent to something browser-y. If that
> doesn't
> work, diff the headers your browser sends and the ones LWP sends and
> alter
> until you find the differentiator.
>
> --
> Dave
> Isa. 40:31
>
>

Re: scraping amazon

am 16.06.2005 03:59:19 von dbrobins

On Wednesday June 15, 2005 17:22, Chris Wildman wrote:
> I am trying to scrape a page on amazon that lists all the sellers selling a
> given book.

Note that scraping pages may be against their terms of service (stupid, I
know, but it could be); if you're just grabbing data for personal use they
probably won't care.

> For some reason making a request to this page(see below) with LWP returns
> successful but there is never any content. Many other pages work. I can get
> back the product detail page or the customer review page, etc.

You might want to set the User-Agent to something browser-y. If that doesn't
work, diff the headers your browser sends and the ones LWP sends and alter
until you find the differentiator.

--
Dave
Isa. 40:31

Re: scraping amazon

am 28.06.2005 00:46:00 von steves06

David Robins wrote:

> You might want to set the User-Agent to something browser-y. If that doesn't
> work, diff the headers your browser sends and the ones LWP sends and alter
> until you find the differentiator.

I get content when I set my user agent to Mozilla, so that's
probably it. When I set it to a value we use internally
for automated GET clients, Amazon gives me a 204 error.

--
Steve Sapovits steves06@comcast.net