Segfault using HTML::Parser and URI::URL

am 08.11.2004 16:58:41 von britzt

Hi there,

the following produces a segfault using the latest version of libwww.

As it seems, HTML::Parser is marking non UTF8 strings as UTF8 strings.

use HTML::TokeParser;
use LWP::Simple;
use URI::URL;

$data = get("http://www.aries.lu/site.php?section=movies");

my $tp = HTML::TokeParser->new(\$data);

while (my $token = $tp->get_token)
{
my $ttype = shift @{ $token };

if($ttype eq "S") # start tag?
{
my($tag, $attr, $attrseq, $rawtxt) = @{ $token };

$tag = lc($tag);

if($tag eq "a")
{
my $a_href = $attr->{'href'};
my $a_encl = $tp->get_trimmed_text("/$tag");
print "$a_href\n";
$a_href = url($a_href, $docurl)->abs if ($a_href
ne "");
}
}

}

or to see it:

#!/usr/bin/perl
use warnings;
use strict;
use Devel::Peek;
use HTML::Parser;
my $html = qq{ $’\260$ };
my $p = HTML::Parser->new(api_version=>3,start_h=>[sub{Dump(shift-
>{title})}, "attr"]);
$p->parse($html);

Thibaut

Re: Segfault using HTML::Parser and URI::URL

am 09.11.2004 15:22:23 von gisle

Thibaut Britz writes:

> the following produces a segfault using the latest version of libwww.

I see segfaults with ActivePerl 810 but not with our latests builds.
What version of perl are you using? The segfault appears to be a bug
in perl I would like to find out if the problem has really been fixed.

> As it seems, HTML::Parser is marking non UTF8 strings as UTF8 strings.

Did you enable the Unicode support when you installed HTML-Parser? It
seems like this would be the only time this happens, but I want to be
sure.

> or to see it:
>
> #!/usr/bin/perl
> use warnings;
> use strict;
> use Devel::Peek;
> use HTML::Parser;
> my $html = qq{ $’\260$ };
> my $p = HTML::Parser->new(api_version=>3,start_h=>[sub{Dump(shift-
> >{title})}, "attr"]);
> $p->parse($html);

What output do you get?

Re: Segfault using HTML::Parser and URI::URL

am 09.11.2004 16:23:40 von britzt

Hi,

The first script segfaults and here is the output of the 2nd script:

SV = PV(0x813b4a0) at 0x813b360
REFCNT = 1
FLAGS = (POK,pPOK,UTF8)
PV = 0x8140dc8 "\342\200\231\260"\0Malformed UTF-8 character
(unexpected continuation byte 0xb0, with no preceding start byte) in
subroutine entry at test.pl line 7.
[UTF8 "\x{2019}\x{0}"]
CUR = 4
LEN = 9

I'm using perl 5.8.5 on a linux machine, and yes, I do have the support
to decode unicode entities turned on. If I have it turned off, the
segfault doesn't happen.

(output if turned off:
SV = PV(0x813b4a0) at 0x813b360
REFCNT = 1
FLAGS = (POK,pPOK)
PV = 0x8140dc8 "\342\200\231\260"\0
CUR = 4
LEN = 9
)

Thibaut

On Tue, 2004-11-09 at 15:22 +0100, Gisle Aas wrote:
> Thibaut Britz writes:
>
> > the following produces a segfault using the latest version of libwww.
>
> I see segfaults with ActivePerl 810 but not with our latests builds.
> What version of perl are you using? The segfault appears to be a bug
> in perl I would like to find out if the problem has really been fixed.
>
> > As it seems, HTML::Parser is marking non UTF8 strings as UTF8 strings.
>
> Did you enable the Unicode support when you installed HTML-Parser? It
> seems like this would be the only time this happens, but I want to be
> sure.
>
> > or to see it:
> >
> > #!/usr/bin/perl
> > use warnings;
> > use strict;
> > use Devel::Peek;
> > use HTML::Parser;
> > my $html = qq{ $’\260$ };
> > my $p = HTML::Parser->new(api_version=>3,start_h=>[sub{Dump(shift-
> > >{title})}, "attr"]);
> > $p->parse($html);
>
> What output do you get?
--
Thibaut Britz

Re: Segfault using HTML::Parser and URI::URL

am 10.11.2004 14:40:37 von gisle

The following patch should make sure that HTML::Parser does not
produce badly encoded SVs. That avoid the problem demonstrated, but I
still need to track down why perl itself segfaulted because of this.

Regards,
Gisle

Index: util.c
============================================================ =======
RCS file: /cvsroot/libwww-perl/html-parser/util.c,v
retrieving revision 2.20
retrieving revision 2.21
diff -u -p -r2.20 -r2.21
--- util.c 8 Nov 2004 14:14:35 -0000 2.20
+++ util.c 10 Nov 2004 13:32:56 -0000 2.21
@@ -209,23 +209,21 @@ decode_entities(pTHX_ SV* sv, HV* entity
}

if (!SvUTF8(sv) && repl_utf8) {
- STRLEN len = t - SvPVX(sv);
- if (len) {
- /* need to upgrade the part that we have looked though */
- STRLEN old_len = len;
- char *ustr = bytes_to_utf8(SvPVX(sv), &len);
- STRLEN grow = len - old_len;
- if (grow) {
- /* XXX It might already be enough gap, so we don't need this,
- but it should not hurt either.
- */
- grow_gap(aTHX_ sv, grow, &t, &s, &end);
- Copy(ustr, SvPVX(sv), len, char);
- t = SvPVX(sv) + len;
- }
- Safefree(ustr);
- }
+ /* need to upgrade sv before we continue */
+ STRLEN before_gap_len = t - SvPVX(sv);
+ char *before_gap = bytes_to_utf8(SvPVX(sv), &before_gap_len);
+ STRLEN after_gap_len = end - s;
+ char *after_gap = bytes_to_utf8(s, &after_gap_len);
+
+ sv_setpvn(sv, before_gap, before_gap_len);
+ sv_catpvn(sv, after_gap, after_gap_len);
SvUTF8_on(sv);
+
+ Safefree(before_gap);
+ Safefree(after_gap);
+
+ s = t = SvPVX(sv) + before_gap_len;
+ end = SvPVX(sv) + before_gap_len + after_gap_len;
}
else if (SvUTF8(sv) && !repl_utf8) {
repl = bytes_to_utf8(repl, &repl_len);
Index: t/uentities.t
============================================================ =======
RCS file: /cvsroot/libwww-perl/html-parser/t/uentities.t,v
retrieving revision 1.8
retrieving revision 1.9
diff -u -p -r1.8 -r1.9
--- t/uentities.t 8 Nov 2004 14:14:42 -0000 1.8
+++ t/uentities.t 10 Nov 2004 13:33:03 -0000 1.9
@@ -14,7 +14,7 @@ unless (&HTML::Entities::UNICODE_SUPPORT
exit;
}

-print "1..13\n";
+print "1..14\n";

print "not " unless decode_entities("&euro") eq "\x{20AC}";
print "ok 1\n";
@@ -90,3 +90,6 @@ print "ok 12\n";

print "not " unless decode_entities("�") eq chr(0xFFFD);
print "ok 13\n";
+
+print "not " unless decode_entities("\260’\260") eq "\x{b0}\x{2019}\x{b0}";
+print "ok 14\n";

Re: Segfault using HTML::Parser and URI::URL

am 10.11.2004 18:56:23 von gisle

Gisle Aas writes:

> The following patch should make sure that HTML::Parser does not
> produce badly encoded SVs. That avoid the problem demonstrated, but I
> still need to track down why perl itself segfaulted because of this.

Perl crashed because the regexp engine did deal properly with bad
UTF8. This will be fixed in perl-5.8.6 by this patch:

http://public.activestate.com/cgi-bin/perlbrowse?patch=23261

Regards,
Gisle