WWW::Mechanize -- bug in find_all_images

WWW::Mechanize -- bug in find_all_images

am 16.06.2007 23:13:43 von peter.stevens

--------------040203030005010108060508
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Hi Andy,

This little script:

use WWW::Mechanize;

my $mech = WWW::Mechanize->new();
$mech->get ( "http://news.google.com" );
my @tImages = $mech->find_all_images( url_regex => qr/imgurl=/ );

Produces the following output:

Use of uninitialized value in pattern match (m//) at .../WWW/Mechanize.pm line 1053.
Use of uninitialized value in pattern match (m//) at .../WWW/Mechanize.pm line 1053.
Use of uninitialized value in pattern match (m//) at .../WWW/Mechanize.pm line 1053.
...


This patch to v1.30 fixes the problem:

--- Mechanize.pm-1.30 2007-06-16 22:42:27.000000000 +0200
+++ Mechanize.pm 2007-06-16 22:59:21.000000000 +0200
@@ -1049,10 +1049,11 @@
# No conditions, anything matches
return 1 unless keys %$p;

- return if defined $p->{url} && !($image->url eq $p->{url} );
- return if defined $p->{url_regex} && !($image->url =~ $p->{url_regex} );
- return if defined $p->{url_abs} && !($image->url_abs eq $p->{url_abs} );
- return if defined $p->{url_abs_regex} && !($image->url_abs =~ $p->{url_abs_regex} );
+ return if defined $p->{url} && !($image->url && $image->url eq $p->{url} ); #[1]
+ return if defined $p->{url_regex} && !($image->url && $image->url =~ $p->{url_regex} );
+ return if defined $p->{url_abs} && !($image->url_abs && $image->url_abs eq $p->{url_abs} );
+ return if defined $p->{url_abs_regex} && !($image->url_abs_regex && $image->url_abs =~ $p->{url_abs_regex} );
+
return if defined $p->{alt} && !(defined($image->alt) && $image->alt eq $p->{alt} );
return if defined $p->{alt_regex} && !(defined($image->alt) && $image->alt =~ $p->{alt_regex} );
return if defined $p->{tag} && !($image->tag && $image->tag eq $p->{tag} );


I'm not sure if all 4 lines really need the change - the second line
would fix my problem - but I put them in to be safe :-)

Cheers,

Peter


--------------040203030005010108060508--

Re: WWW::Mechanize -- bug in find_all_images

am 16.06.2007 23:17:05 von Andy

On Jun 16, 2007, at 4:13 PM, Peter Stevens wrote:

> + return if defined $p->{url} && !($image->url && $image->url eq $p-
> >{url} ); #[1] + return if defined $p->{url_regex} && !($image->url
> && $image->url =~ $p->{url_regex} ); + return if defined $p->
> {url_abs} && !($image->url_abs && $image->url_abs eq $p->
> {url_abs} ); + return if defined $p->{url_abs_regex} && !($image-
> >url_abs_regex && $image->url_abs =~ $p->{url_abs_regex} ); +

But why would $image->url come back as undef? That should be the
real thing to check.

--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance

Re: WWW::Mechanize -- bug in find_all_images

am 17.06.2007 05:27:01 von peter.stevens

--------------000106070809060206050003
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

Andy Lester wrote:
> But why would $image->url come back as undef? That should be the real
> thing to check.
>
Quite simple really. There are images which are not containd in an
... block. Again from news.google.com, here are two examples:




I like the first example because it is a pure placeholder. No real image
at all :-)

Cheers,
Peter

--------------000106070809060206050003--

Re: WWW::Mechanize -- bug in find_all_images

am 17.06.2007 06:07:09 von Andy

On Jun 16, 2007, at 10:27 PM, Peter Stevens wrote:

> Quite simple really. There are images which are not containd in an
> ... block. Again from news.google.com, here are two examples:
>
>
>

It's not that they're not in tags. It's that the first one
doesn't have an src. That's bizarre. I'm not sure it's a behavior
I'm too worried about.

--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance

Re: WWW::Mechanize -- bug in find_all_images

am 17.06.2007 07:35:04 von peter.stevens

>>
>
> It's not that they're not in
tags. It's that the first one
> doesn't have an src. That's bizarre. I'm not sure it's a behavior
> I'm too worried about.
Sorry, your right..

news.google.com uses src-less imgs 16 times and that is exactly how many
errors mech reports.

The patch keeps my log files much smaller. :-)

I do hope you will add it to the standard release.

Thanks

Peter

>
>
>
>