Possible bug in HTML::Parser

am 16.11.2005 02:05:19 von mark

Hello.

I am using the HTML::Parser module to parse a list of bookmarks
exported from the Firefox browser. Firefox exports bookmarks to an
HTML file containing nested definition lists.

I have discovered that when the parser encounters a bookmark
whose name ends in a closing parenthesis, the closing parenthesis
is stripped. (Bookmark names are coded as definition terms, using
the

tag.)

A sample of the code being parsed looks like this:

ID="rdf:#$.GjDP">Google (search engine)

The decoded text passed to the handler by HTML::Parser
would be "Google (search engine".

Any ideas whether this is a bug in HTML::Parser, or should I
take another look at my code?

Thanks
-Mark

Re: Possible bug in HTML::Parser

am 16.11.2005 09:37:40 von Bart Lateur

Mark wrote:

>

>ID="rdf:#$.GjDP">Google (search engine)
>
>The decoded text passed to the handler by HTML::Parser
>would be "Google (search engine".

I've tried it with HTML::TokeParser::Simple, which is built on top of
HTML::Parser, and it comes out well:

$html = << '--';

ID="rdf:#$.GjDP">Google (search engine)
--
use HTML::TokeParser::Simple;
my $p = HTML::TokeParser::Simple->new( \$html );

while ( my $token = $p->get_token ) {
print $token->as_is;
}

This prints:

ID="rdf:#$.GjDP">
Google (search engine)

>Any ideas whether this is a bug in HTML::Parser, or should I
>take another look at my code?

My guess is that you only get part of the text, and you have to be
patient, because there is no garantee at all that all of the text will
come out in one chunk. So probably next time the text handler gets
called, the rest will come out... at least, part of it.

--
Bart.

Re: Possible bug in HTML::Parser

am 17.11.2005 18:03:59 von mark

"Bart Lateur" wrote:
>> Mark wrote:
>>
>>

>> ID="rdf:#$.GjDP">Google (search engine)
>>
>> The decoded text passed to the handler by HTML::Parser
>> would be "Google (search engine".
>
> I've tried it with HTML::TokeParser::Simple, which is built on top of
> HTML::Parser, and it comes out well:
>

Ok, I've replicated your example using HTML::TokeParser::Simple.
But I would sure hate to scrap the hours I just spent learning
HTML::Parser, and re-write with TokeParser. After all, TokeParser
was supposedly written to save people from having to learn
HTML::Parser!

Can anyone here identify the problem with HTML::Parser, or
perhaps my (mis)use of this module? If TokeParser is based on
HTML::Parser, then it seems odd that it does not encounter
the same problem (unless it works around it somehow.)

Thanks
-Mark

Re: Possible bug in HTML::Parser

am 17.11.2005 18:57:45 von mark

On the other hand, the following test works fine.
So I guess I need to take a closer look at my code.

use strict;
use HTML::Parser ();

my $txt = << 'EOTEXT';

EOTEXT

my $p = HTML::Parser->new(api_version => 3);
$p->handler(text => \&text_handler, "dtext");
$p->parse($txt);

sub text_handler
{print shift, "\n";}

SOLVED: Possible bug in HTML::Parser

am 17.11.2005 19:12:49 von mark

And the culprit was. . .an incorrect regular expression in my code.

Feh.

Thanks in advance for not banning me.

-Mark

Re: Possible bug in HTML::Parser

am 17.11.2005 20:48:56 von Bart Lateur

Mark wrote:

>On the other hand, the following test works fine.
>So I guess I need to take a closer look at my code.
>
>
>use strict;
>use HTML::Parser ();
>
>my $txt = << 'EOTEXT';
>
>EOTEXT
>
>my $p = HTML::Parser->new(api_version => 3);
>$p->handler(text => \&text_handler, "dtext");
>$p->parse($txt);
>
>sub text_handler
>{print shift, "\n";}

Are you sure the original problem doesn't produce:

Google (search engine
)

?
Thus, the text handler called twice?

--
Bart.