Help parsing an HTML Doc using HTML::TreeBuilder

Help parsing an HTML Doc using HTML::TreeBuilder

am 15.10.2004 17:29:49 von hillr

Hi All,

I am having a hard time extracting data from an HTML file I am
downloading from the web. What I want to do is extract the name and =
jersey
number from a soccer web page.

Here is what I have tried:

#!/usr/bin/perl
use strict;
use warnings;
use LWP::UserAgent;
use HTML::TreeBuilder;
my $ua =3D LWP::UserAgent->new;
=20
my $url =3D 'http://www.coastsoccer.com:443/2004/P4619.HTM';=20
my $req =3D HTTP::Request->new(GET =3D> "$url");
=20
# send request
my $res =3D $ua->request($req);

# check the outcome
unless($res->is_success) {
warn "Couldn't get $url: ", $res->status_line, "\n";
return;
}

my $tree =3D HTML::TreeBuilder->new_from_content($res->content);
$tree->eof;

my $realtable =3D $tree->look_down(
'_tag', 'table',
sub {
my $table =3D $_[0]->look_down('_tag','table');
return 1 if $table->attr('cellpadding') =3D~ m{/6/};
return 0; # otherwise bad
}
); =20
And ther error is:
F:\scripts>testxxx.pl
Use of uninitialized value in pattern match (m//) at =
F:\scripts\testxxx.pl line 27.

It seems that it does not recognize the ceelpadding attr (which should =
be 6)
Can anyone tell me what Iam doing wrong?

Thanks

Ron Hill

Re: Help parsing an HTML Doc using HTML::TreeBuilder

am 15.10.2004 22:08:16 von Robert

Hill, Ronald wrote:
[snip]
> my $realtable = $tree->look_down(
> '_tag', 'table',
> sub {
> my $table = $_[0]->look_down('_tag','table');
> return 1 if $table->attr('cellpadding') =~ m{/6/};
> return 0; # otherwise bad
> }
> );
[snip]
Looks like your $_[0]->look_down('_tag', 'table'); isn't finding anything.

BTW, do you mean to be looking down a second time? look_down, as I
understand it will recurse through the child nodes until it finds the
first instance of what you are looking for.

--
In Reach Technology: http://www.inreachtech.net/

Robert G. Werner
robert@inreachtech.net

Tel: 559.304.5122

Windows NT -- it'll drive you buggy!

-- Gareth Barnard

"binary files" in mechanize

am 15.10.2004 22:16:48 von gedanken

im having a devil of a time debugging this.

I have two nearly identical requests for a up-to-now working module. Some
requests come back just fine. Others come back as a stream of binary
junk. The file is the correct size however, and since the requests are
nearly identical, i suspect that the response i get is technically the
correct page, just in a format that is unusable. a sample would be:

$response = bless( {
'_protocol' => 'HTTP/1.0',
'_content' =>
'^_~K^H^@^@^@^@^@^@^Cí}ÛrÛH²à³õ^UÕ~X^X~QZI^T^Aêj~IìP[R[glË#Ó ÝÓãã
P~@D~Q~D^H^Bt^A^PÍ~^vÄþÏ~ÂþÌ~I}:~ß·}Ù̺^@~E^[IÓn~_~^3æL[@Uå¥ ²²²²^Ru9ûnw~WüH}Êì~H
:¤7\'?ÓÞ«)~Eç]2~J¢éã½½ÙlÖ~XÑ^~H©~M~0Ù#»»~]~M~M³Q4ñ:^[gO»Ï~_u 6^H9{õäöúe·CÈ^Fy°^Y~
YÌÇt~


again this is a 'known working' site. the vast majority of city codes i
use end up returning wonderful pages of proper html. its just a few
anomaly city codes that return this mess.

If i grep it looking for phrases i expect to see on that page, were it
text, grep tells me that it IS a match but warns me that the file in
question is a binary file. so for example :

#>grep "currcode=EUR" 0006.res
Binary file 0006.res matches


any ideas?


--
gedanken

Re: Help parsing an HTML Doc using HTML::TreeBuilder

am 19.10.2004 22:51:20 von emc

>...
> my $realtable = $tree->look_down(
> '_tag', 'table',
> sub {
> my $table = $_[0]->look_down('_tag','table');
> return 1 if $table->attr('cellpadding') =~ m{/6/};
> return 0; # otherwise bad
> }
> );

Try this
my $realtable = $tree->look_down(_tag=>'table', cellpadding =>6);
or this
my $realtable = $tree->look_down(_tag=>'table', cellpadding =>qr/6/);
instead

and for verification:
if ($realtable) {
print $realtable->as_text, "\n", $realtable->as_HTML, "\n";
}
else {
print "realtable was not there\n";
}