RT bug filed for HTML::Parser"s HTML::TokeParser

am 25.04.2006 07:12:02 von ryantate

Hello,

I have filed an RT bug on HTML::Parser due to an apparent bug in
HTML::TokeParser:

http://rt.cpan.org/Public/Bug/Display.html?id=18904

This is the email address listed in the distribution for bug reporting.
My apologies if this message is duplicative.

I have posted the message below in case you prefer to read it over
email.

Cheers
RT

Hello,

I enjoy HTML::TokeParser but just today noticed a flaw. When parsing
self-closing tags like
(which is techincally XHTML), the parser
fails to properly identify the tag when there is no internal space.

So, for
the tag is correctly identified as "br" in the second
element of token array returned by ->get_token. But the
tag (no
intrnal space_ is identified as "br/" in the second element of the token
array returned by ->get_token.

Note that self-closing tags are not required to have an internal space
in the XHTML spec, see

http://www.w3.org/TR/xhtml1/#h-4.6

Here is a test case which demonstrates the problem:

use strict;
use HTML::TokeParser;

my $htmlf = "line 1 is here
Now line 2
Now line 3
Now
line 4";

my $parsed = HTML::TokeParser->new(\$htmlf);

while (my $token = $parsed->get_token) {
if ($token->[0] eq 'S') {
print "start tag: " . $token->[1] . "(full text: '" . $token->[4] .
"')\n";
}
elsif ($token->[0] eq 'E') {
print "end tag: " . $token->[1] . "(full text: '" . $token->[4] .
"')\n";
}
}

This outputs:

start tag: br(full text: '
')
start tag: br(full text: '
')
start tag: br/(full text: '
')

This mis-identification of the tag name can cause problems when I'm
trying to filter for certain "allowed tags", for example in a message
board post, and I have named "br" as an allowed tag. Now I must also
identify "br/" as an allowed tag.

Hope this makes sense!

-Ryan Tate
ryantate@ryantate.com

Re: RT bug filed for HTML::Parser"s HTML::TokeParser

am 25.04.2006 23:19:13 von gisle

"Ryan Tate" writes:

> I enjoy HTML::TokeParser but just today noticed a flaw. When parsing
> self-closing tags like
(which is techincally XHTML), the parser
> fails to properly identify the tag when there is no internal space.
>
> So, for
the tag is correctly identified as "br" in the second
> element of token array returned by ->get_token. But the
tag (no
> intrnal space_ is identified as "br/" in the second element of the token
> array returned by ->get_token.

The behaviour you want can be enabled by $p->empty_element_tags(1).
If only I could remember why I thought making this the default would
be a bad idea :(

--Gisle