Problem with body text extraction with HTML::Parser

Problem with body text extraction with HTML::Parser

am 13.12.2005 14:28:42 von Perl_user

Hi,

I have been using HTML::Parser to extract the textual data from an HTML
document

I am using the following code:

my $p = HTML::Parser->new(api_version => 3,
start_h => [\&a_start_handler, "self,tagname"],
report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
);
$p->parse_file($file || die) || die $!;

sub a_start_handler
{
my($self, $tag) = @_;
$self->handler(text => [], '@{dtext}' );
$self->handler(start => \&text);
$self->handler(end => \&a_end_handler, "self,tagname,text");
}

sub text
{
my($self, $tag) = @_;
my $text=@{$self->handler("text")};

}

sub a_end_handler
{
my($self, $tag) = @_;
my $text = join("", @{$self->handler("text")});
$self->handler("text", undef);
$self->handler("start", \&a_start_handler);
$self->handler("end", undef);
}

which reports the title and headers from the page. This works, but I have a
problem getting the body text (seperately), as it isn't contained inside
HTML tags, that I can report.


Web-site...
------------


This is the title of the webpage.


First Type Header


Second Type Header


This is the main body of the text. It will be concidered as the article.
Blah Blah blah


------------

Any ideas appreciated


Output with reported tags [title h1 h2 h3 h4 h5 h6]
--
title
This is the title of the webpage.
h1
First type
h2
Second Type Header
h3
Third Type Header
--

Output with reported tag
--
This is the title of the webpage. It is a mess.
First type header
Second Type Header
Third Type Header
This is the main body of the text. It will be concidered as the article.
--

with regards,
Perlusr

Re: Problem with body text extraction with HTML::Parser

am 01.01.2006 02:18:06 von Jim Keenan

Perl_user wrote:
> Hi,
>
> I have been using HTML::Parser to extract the textual data from an HTML
> document
>
> I am using the following code:
>
> my $p = HTML::Parser->new(api_version => 3,
> start_h => [\&a_start_handler, "self,tagname"],
> report_tags => [qw(title h1 h2 h3 h4 h5 h6)],
> );
> $p->parse_file($file || die) || die $!;

Is this the entirety of your script? What comes next?

Re: Problem with body text extraction with HTML::Parser

am 01.01.2006 15:21:14 von Jim Keenan

Perl_user wrote:
> Hi,
>
> I have been using HTML::Parser to extract the textual data from an HTML
> document
>
> I am using the following code:
>
[snip]

>
> which reports the title and headers from the page. This works, but I have a
> problem getting the body text (seperately), as it isn't contained inside
> HTML tags, that I can report.
>
>
The code shown seems largely based on one of the examples provided in
the CPAN HTML::Parser documentation
(http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/hanchor s). If you
look at one of the other samples scripts in the same location
(http://search.cpan.org/src/GAAS/HTML-Parser-3.48/eg/htext), you should
be able to work up a solution.

Jim Keenan