Problem with body text extraction with HTML::Parser
am 13.12.2005 14:28:42 von Perl_userHi,
I have been using HTML::Parser to extract the textual data from an HTML
document
I am using the following code:
my $p = HTML::Parser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname"],
report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
);
$p->parse_file($file || die) || die $!;
sub a_start_handler
{
my($self, $tag) = @_;
$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&text);
$self->handler(end => \&a_end_handler, "self,tagname,text");
}
sub text
{
my($self, $tag) = @_;
my $text=@{$self->handler("text")};
}
sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}
which reports the title and headers from the page. This works, but I have a
problem getting the body text (seperately), as it isn't contained inside
HTML tags, that I can report.
Web-site...
------------
First Type Header
Second Type Header
This is the main body of the text. It will be concidered as the article.
Blah Blah blah
------------
Any ideas appreciated
Output with reported tags [title h1 h2 h3 h4 h5 h6]
--
title
This is the title of the webpage.
h1
First type
h2
Second Type Header
h3
Third Type Header
--
Output with reported tag
--
This is the title of the webpage. It is a mess.
First type header
Second Type Header
Third Type Header
This is the main body of the text. It will be concidered as the article.
--
with regards,
Perlusr