RT bug filed for HTML::Parser"s HTML::TokeParser
am 25.04.2006 07:12:02 von ryantateHello,
I have filed an RT bug on HTML::Parser due to an apparent bug in
HTML::TokeParser:
http://rt.cpan.org/Public/Bug/Display.html?id=18904
This is the email address listed in the distribution for bug reporting.
My apologies if this message is duplicative.
I have posted the message below in case you prefer to read it over
email.
Cheers
RT
Hello,
I enjoy HTML::TokeParser but just today noticed a flaw. When parsing
self-closing tags like
(which is techincally XHTML), the parser
fails to properly identify the tag when there is no internal space.
So, for
the tag is correctly identified as "br" in the second
element of token array returned by ->get_token. But the
tag (no
intrnal space_ is identified as "br/" in the second element of the token
array returned by ->get_token.
Note that self-closing tags are not required to have an internal space
in the XHTML spec, see
http://www.w3.org/TR/xhtml1/#h-4.6
Here is a test case which demonstrates the problem:
use strict;
use HTML::TokeParser;
my $htmlf = "line 1 is here
Now line 2
Now line 3
Now
line 4";
my $parsed = HTML::TokeParser->new(\$htmlf);
while (my $token = $parsed->get_token) {
if ($token->[0] eq 'S') {
print "start tag: " . $token->[1] . "(full text: '" . $token->[4] .
"')\n";
}
elsif ($token->[0] eq 'E') {
print "end tag: " . $token->[1] . "(full text: '" . $token->[4] .
"')\n";
}
}
This outputs:
start tag: br(full text: '
')
start tag: br(full text: '
')
start tag: br/(full text: '
')
This mis-identification of the tag name can cause problems when I'm
trying to filter for certain "allowed tags", for example in a message
board post, and I have named "br" as an allowed tag. Now I must also
identify "br/" as an allowed tag.
Hope this makes sense!
-Ryan Tate
ryantate@ryantate.com