html parsing

html parsing

am 02.05.2005 17:29:54 von malcolm.mill

Hi,=20
I'm trying to extract information from html like this...

http://www.rafb.net/paste/results/Ze4RTm27.html

I've tried modifiying examples from the man pages for HTML::TokeParser,=20
and HTML::TreeBuilder without much success.

I just want to identify such blocks of html by the attributes in the
child nodes; extract the text node under the first '',
extract the text node under the second '' as well as the href
attribute in the enclosed '' node,
store the output in a hash which I can pass to other functions or
print to a csv file.

If anyone can suggest anything while I read the docs and relevant
hacks in "Spidering Hacks" more carefully it would be appreciated.

Regards,=20
Malcolm.

RE: html parsing

am 02.05.2005 17:36:55 von Andrew.Johnson

You should consult O'Reilly's Perl and LWP for a good explanation of how =
to
use the toke parser. Here's some code that I wrote.

The important part is that token->[0] refers to the token type.=20
Token->[1] often holds the text of the token.
Token->[4] has the source code in case of a start tag.

Andrew Johnson
Marketing Writer =20
Elias/Savion Advertising=20
Phone: 412.642.7700 Fax 412.642.2277
www.elias-savion.com=20
andrew.johnson@elias-savion.com =20

sub Report
{ open (ARTICLES, "$_[0]");
open (DATA, ">data.csv");
while ()
{
my $count=3D0;
my $numtokens=3D0;
my $response=3D$browser->get("$_");
die "Error getting: ", $response->status_line,
$response->headers_as_string
if $response->is_error;
my $content =3D $response->content;
my $stream =3D HTML::TokeParser->new(\$content)
|| die "Coulnd't read HTML $content BLAH BLAH LAH";
my $header =3D$response->header('X-META-PUB-DATE');
$header=3D~ s/,/;/g;
if (!$header)
{
$header=3D'N/A';
}
print DATA "$header,";=09
print DATA "BusinessWeek,";
$header=3D$response->header('X-META-AUTHOR');
$header=3D~ s/,/;/g;
$header=3D~s/\n/ /;
if (!$header)
{
$header=3D'N/A';
}
print DATA "$header,";
$header =3D$response->header('X-META-HEADLINE');
$header=3D~ s/,/;/g;
if (!$header)
{
$header=3D'N/A';
}
print DATA "$header,";
my %keyfinds;
while(my $token=3D$stream->get_token)
{
if ($token->[0] eq 'T')
{
if ($token->[1] =3D~ /\w/)
{
if ($token->[1] =3D~
/(BUSH|CLINTON)/)
{
$keyfinds{$1}+=3D1;
$numtokens++;
my @rawdata=3D$token->[1];
chomp @rawdata;
foreach my $line (@rawdata)
{
$line =3D~ s/\t/ /g;
my
@array=3Dsplit(/\s/,$line);
foreach my $word
(@array)
{
unless($word
eq '')
{
=20
$count++;
}
}
}
}
}
}
}
my $value =3D $count*3885/20;
print DATA
"$count,$numtokens,977128,$value";$count=3D0;$numtokens=3D0;
my $highest=3D0;
my $highstring;
foreach my $key (%keyfinds)
{
if ($keyfinds{$key} > $highest)
{
$highest=3D$keyfinds{$key};
$highstring=3D$key;
}
}
if ($highstring)
{
print DATA ",$highstring,$highest,$_"; $highest=3D0;

}
else=20
{
print DATA ",,,$_";
}=20
} =09
close DATA;
close ARTICLES;
} =09

=20
-----Original Message-----
From: Malcolm Mill [mailto:malcolm.mill@gmail.com]=20
Sent: Monday, May 02, 2005 11:30 AM
To: libwww@perl.org
Subject: html parsing

Hi,=20
I'm trying to extract information from html like this...

http://www.rafb.net/paste/results/Ze4RTm27.html

I've tried modifiying examples from the man pages for HTML::TokeParser,=20
and HTML::TreeBuilder without much success.

I just want to identify such blocks of html by the attributes in the
child nodes; extract the text node under the first '',
extract the text node under the second '' as well as the href
attribute in the enclosed '' node,
store the output in a hash which I can pass to other functions or
print to a csv file.

If anyone can suggest anything while I read the docs and relevant
hacks in "Spidering Hacks" more carefully it would be appreciated.

Regards,=20
Malcolm.