HTML::Parser modifies unicode characters

HTML::Parser modifies unicode characters

am 11.09.2004 21:37:49 von kaminsky

--KsGdsel6WgEHnImy
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

Hi,

It appears that HTML::Parser modifies some unicode characters while=20
parsing. The following program gives an example:

#########

#!/usr/bin/perl
use HTML::Parser;
use utf8;
open TEST, '>:utf8', 'word.txt';
my $p =3D new HTML::Parser text_h =3D> [sub {print TEST shift}, 'text'];
$p->parse("zespołów\n");
close TEST;

#########

After running it, 'word.txt' contains "zespołów" with the funny l=
and=20
the funny o following it transformed to something else. What am I doing=20
wrong?
I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.

Thanks,
Moshe


--KsGdsel6WgEHnImy
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBQ1QNkBjmVsKMBeMRAkbWAJwLof10+zQr3L1D/8NmrQw1FCvGEACg 5R0q
gu1OzobhcH6CVssbIQ2Kt0Q=
=7zIh
-----END PGP SIGNATURE-----

--KsGdsel6WgEHnImy--

Re: HTML::Parser modifies unicode characters

am 12.09.2004 00:53:30 von Dom

Moshe Kaminsky wrote:
> It appears that HTML::Parser modifies some unicode characters while
> parsing. The following program gives an example:
>
> #########
>
> #!/usr/bin/perl
> use HTML::Parser;
> use utf8;
> open TEST, '>:utf8', 'word.txt';
> my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
> $p->parse("zespołów\n");
> close TEST;
>
> #########
>
> After running it, 'word.txt' contains "zespołów" with the funny l and
> the funny o following it transformed to something else. What am I doing
> wrong?
> I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.

It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a
nasty tendency to do this. :(

Thankfully the workaround is fairly simple. Add "use Encode" to the top
of the script, and change the callback slightly:

sub { print TEST decode_utf8(shift) }

seems to work ok here.

-Dom

Re: HTML::Parser modifies unicode characters

am 12.09.2004 07:33:09 von kaminsky

--d6Gm4EdcadzBjdND
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Dominic Mitchell [12/09/04 01:53]:
> Moshe Kaminsky wrote:
> >It appears that HTML::Parser modifies some unicode characters while=20
> >parsing. The following program gives an example:
> >
> >#########
> >
> >#!/usr/bin/perl
> >use HTML::Parser;
> >use utf8;
> >open TEST, '>:utf8', 'word.txt';
> >my $p =3D new HTML::Parser text_h =3D> [sub {print TEST shift}, 'text'];
> >$p->parse("zespołów\n");
> >close TEST;
> >
> >#########
> >
> >After running it, 'word.txt' contains "zespołów" with the funn=
y l and=20
> >the funny o following it transformed to something else. What am I doing=
=20
> >wrong?
> >I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
>=20
> It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a=
=20
> nasty tendency to do this. :(
>=20
> Thankfully the workaround is fairly simple. Add "use Encode" to the top=
=20
> of the script, and change the callback slightly:
>=20
> sub { print TEST decode_utf8(shift) }
>=20
> seems to work ok here.

Thanks! That actually works. However, my real situation is that I'm=20
using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and=20
HTML::Parser. So to fix the problem, it appears that the only way is to=20
modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser=20
are aware of this problem, and if so, why don't they do this=20
automatically (or at least add an option to do it automatically) before=20
giving the text to the handler?

Anyway, thanks again.
Moshe

>=20
> -Dom
>=20

--=20
I love deadlines. I like the whooshing sound they make as they fly by.=20
-- Douglas Adams
=20
Moshe Kaminsky
Home: 08-9456841


--d6Gm4EdcadzBjdND
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD4DBQFBQ9+VkBjmVsKMBeMRAuroAJ0QtsKld+YTXp7ceZPaeJrU65uA6ACX QMRy
7I7ZAFRoX+pYkn0DnCutKw==
=rG12
-----END PGP SIGNATURE-----

--d6Gm4EdcadzBjdND--

Re: HTML::Parser modifies unicode characters

am 13.09.2004 10:53:32 von Dom

Moshe Kaminsky wrote:

> * Dominic Mitchell [12/09/04 01:53]:
>
>>Moshe Kaminsky wrote:
>>
>>>It appears that HTML::Parser modifies some unicode characters while
>>>parsing. The following program gives an example:
>>>
>>>#########
>>>
>>>#!/usr/bin/perl
>>>use HTML::Parser;
>>>use utf8;
>>>open TEST, '>:utf8', 'word.txt';
>>>my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
>>>$p->parse("zespołów\n");
>>>close TEST;
>>>
>>>#########
>>>
>>>After running it, 'word.txt' contains "zespołów" with the funny l and
>>>the funny o following it transformed to something else. What am I doing
>>>wrong?
>>>I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
>>
>>It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a
>>nasty tendency to do this. :(
>>
>>Thankfully the workaround is fairly simple. Add "use Encode" to the top
>>of the script, and change the callback slightly:
>>
>> sub { print TEST decode_utf8(shift) }
>>
>>seems to work ok here.
>
>
> Thanks! That actually works. However, my real situation is that I'm
> using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and
> HTML::Parser. So to fix the problem, it appears that the only way is to
> modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser
> are aware of this problem, and if so, why don't they do this
> automatically (or at least add an option to do it automatically) before
> giving the text to the handler?

Hmmm, it's a known problem:

http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS

It doesn't look unsolveable, but it's slightly beyond my XS skills. The
key is indicating the character encoding of what you're parsing, but
that's sometimes difficult to determine in advance (think HTML meta tags).

As to how to fix it via HTML::FormatText, I'm not sure. You'd need to
read through the code to find out what it's doing and fix at an
appropriate point. But perhaps there is another way. Instead of
writing out to a file, can you write to an in-memory string? If so,
then that string would be in UTF-8-without-the-UTF-8 flag set. So you
could fix that by doing "decode_utf8()" over that string before writing
it to a file. Or simply write that file out without any encoding which
would do no transformation of the UTF-8 bytes.

-Dom

--
| Semantico: creators of major online resources |
| URL: http://www.semantico.com/ |
| Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 |
| Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |

Re: HTML::Parser modifies unicode characters

am 13.09.2004 12:05:52 von kaminsky

--pWyiEgJYm5f9v55/
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Dominic Mitchell [13/09/04 12:05]:
> Moshe Kaminsky wrote:
>=20
> >* Dominic Mitchell [12/09/04 01:53]:
> >
> >>Moshe Kaminsky wrote:
> >>
> >>>It appears that HTML::Parser modifies some unicode characters while=20
> >>>parsing. The following program gives an example:
> >>>
> >>>#########
> >>>
> >>>#!/usr/bin/perl
> >>>use HTML::Parser;
> >>>use utf8;
> >>>open TEST, '>:utf8', 'word.txt';
> >>>my $p =3D new HTML::Parser text_h =3D> [sub {print TEST shift}, 'text'=
];
> >>>$p->parse("zespołów\n");
> >>>close TEST;
> >>>
> >>>#########
> >>>
> >>>After running it, 'word.txt' contains "zespołów" with the fu=
nny l and=20
> >>>the funny o following it transformed to something else. What am I doin=
g=20
> >>>wrong?
> >>>I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
> >>
> >>It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a=
=20
> >>nasty tendency to do this. :(
> >>
> >>Thankfully the workaround is fairly simple. Add "use Encode" to the to=
p=20
> >>of the script, and change the callback slightly:
> >>
> >> sub { print TEST decode_utf8(shift) }
> >>
> >>seems to work ok here.
> >
> >
> >Thanks! That actually works. However, my real situation is that I'm=20
> >using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and=20
> >HTML::Parser. So to fix the problem, it appears that the only way is to=
=20
> >modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser=20
> >are aware of this problem, and if so, why don't they do this=20
> >automatically (or at least add an option to do it automatically) before=
=20
> >giving the text to the handler?
>=20
> Hmmm, it's a known problem:
>=20
> http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BUGS

Thanks. I must say, though, that the explanation there is quite vague. I=20
don't see myself deducing your solution from this statement.

>=20
> It doesn't look unsolveable, but it's slightly beyond my XS skills. =20
> The key is indicating the character encoding of what you're parsing,=20
> but that's sometimes difficult to determine in advance (think HTML=20
> meta tags).

I know nothing about XS, unfortunately, but the way I imagine it is that=20
at some point, HTML::Parser calls the method given by text_h, passing=20
the text to it. So instead of just passing the text, I suggest that it=20
should pass decode_utf8 applied to the text. Alternatively, call a fixed=20
(usual perl) sub 'foo', giving it the value of text_h and the text, and=20
foo will apply decode_utf8 to the text and than pass the result to=20
text_h.
>=20
> As to how to fix it via HTML::FormatText, I'm not sure. You'd need to=20
> read through the code to find out what it's doing and fix at an=20
> appropriate point.

I did it. It is in fact in HTML::TreeBuilder. The thing is that I'm=20
giving this code to people, so now I need to tell people to do this=20
change as well (and they might not have the right permission, might not=20
know perl, may have a different version of HTML::TreeBuilder ...)

> But perhaps there is another way. Instead of writing out to a file,=20
>can you write to an in-memory string? If so, then that string would be=20
>in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing=20
>"decode_utf8()" over that string before writing it to a file. Or=20
>simply write that file out without any encoding which would do no=20
>transformation of the UTF-8 bytes.

In the real life example I'm not writing to a file at all, I just did it=20
in the example to make it easy to verify. But the usage is hidden inside=20
HTML::FormatText, which gives me a text formatting of the whole html=20
page. And if I try to use decode_utf8 on this result, I get other=20
gibberish (presumably because that result already is a perl string).

Thanks for the help.
Moshe

>=20
> -Dom
>=20
> --=20
> | Semantico: creators of major online resources |
> | URL: http://www.semantico.com/ |
> | Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 |
> | Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |
>=20

--=20
I love deadlines. I like the whooshing sound they make as they fly by.=20
-- Douglas Adams
=20
Moshe Kaminsky


--pWyiEgJYm5f9v55/
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBRXEAkBjmVsKMBeMRAhqbAJ9nXPzQ60Dx0pbi6NKAWxu3L8pVhACd HV/Y
y93Dl+7KdbsNs5oEso2vryE=
=KVvE
-----END PGP SIGNATURE-----

--pWyiEgJYm5f9v55/--

Re: HTML::Parser modifies unicode characters

am 13.09.2004 12:52:16 von Dom

Moshe Kaminsky wrote:
> * Dominic Mitchell [13/09/04 12:05]:
>>Moshe Kaminsky wrote:
>>>* Dominic Mitchell [12/09/04 01:53]:
>>>>Moshe Kaminsky wrote:
>>>>>It appears that HTML::Parser modifies some unicode characters while
>>>>>parsing. The following program gives an example:
>>>>>
>>>>>#########
>>>>>
>>>>>#!/usr/bin/perl
>>>>>use HTML::Parser;
>>>>>use utf8;
>>>>>open TEST, '>:utf8', 'word.txt';
>>>>>my $p = new HTML::Parser text_h => [sub {print TEST shift}, 'text'];
>>>>>$p->parse("zespołów\n");
>>>>>close TEST;
>>>>>
>>>>>#########
>>>>>
>>>>>After running it, 'word.txt' contains "zespołów" with the funny l and
>>>>>the funny o following it transformed to something else. What am I doing
>>>>>wrong?
>>>>>I'm running: perl 5.8.5, HTML::Parser version 3.36 on linux.
>>>>
>>>>It looks like HTML::Parser is losing the UTF-8 flag. XS modules have a
>>>>nasty tendency to do this. :(
>>>>
>>>>Thankfully the workaround is fairly simple. Add "use Encode" to the top
>>>>of the script, and change the callback slightly:
>>>>
>>>>sub { print TEST decode_utf8(shift) }
>>>>
>>>>seems to work ok here.
>>>
>>>
>>>Thanks! That actually works. However, my real situation is that I'm
>>>using HTML::FormatText, which uses, eventually, HTML::TreeBuilder and
>>>HTML::Parser. So to fix the problem, it appears that the only way is to
>>>modify HTML::TreeBuilder? I wonder if the maintainers of HTML::Parser
>>>are aware of this problem, and if so, why don't they do this
>>>automatically (or at least add an option to do it automatically) before
>>>giving the text to the handler?
>>
>>Hmmm, it's a known problem:
>>
>>http://search.cpan.org/~gaas/HTML-Parser-3.36/Parser.pm#BU GS
>
>
> Thanks. I must say, though, that the explanation there is quite vague. I
> don't see myself deducing your solution from this statement.

It's more just guesswork, based on experience with Perl's Unicode. Most
problems come down to something or other losing the UTF-8 flag on a
scalar and are solved with the Encode module. Encode::_is_utf8() is a
handy tool for checking that this is happening.

>>It doesn't look unsolveable, but it's slightly beyond my XS skills.
>>The key is indicating the character encoding of what you're parsing,
>>but that's sometimes difficult to determine in advance (think HTML
>>meta tags).
>
>
> I know nothing about XS, unfortunately, but the way I imagine it is that
> at some point, HTML::Parser calls the method given by text_h, passing
> the text to it. So instead of just passing the text, I suggest that it
> should pass decode_utf8 applied to the text. Alternatively, call a fixed
> (usual perl) sub 'foo', giving it the value of text_h and the text, and
> foo will apply decode_utf8 to the text and than pass the result to
> text_h.

The trouble is that there's no guarantee that in the general case, the
input will always be UTF-8. At some point in all this, the input
character encoding needs to be specified. Only from that can the
appropriate action be taken.

>>As to how to fix it via HTML::FormatText, I'm not sure. You'd need to
>>read through the code to find out what it's doing and fix at an
>>appropriate point.
>
>
> I did it. It is in fact in HTML::TreeBuilder. The thing is that I'm
> giving this code to people, so now I need to tell people to do this
> change as well (and they might not have the right permission, might not
> know perl, may have a different version of HTML::TreeBuilder ...)
>
>
>>But perhaps there is another way. Instead of writing out to a file,
>>can you write to an in-memory string? If so, then that string would be
>>in UTF-8-without-the-UTF-8 flag set. So you could fix that by doing
>>"decode_utf8()" over that string before writing it to a file. Or
>>simply write that file out without any encoding which would do no
>>transformation of the UTF-8 bytes.
>
>
> In the real life example I'm not writing to a file at all, I just did it
> in the example to make it easy to verify. But the usage is hidden inside
> HTML::FormatText, which gives me a text formatting of the whole html
> page. And if I try to use decode_utf8 on this result, I get other
> gibberish (presumably because that result already is a perl string).

-Dom

--
| Semantico: creators of major online resources |
| URL: http://www.semantico.com/ |
| Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 |
| Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |

Re: HTML::Parser modifies unicode characters

am 13.09.2004 13:15:28 von kaminsky

--YZ5djTAD1cGYuMQK
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

* Dominic Mitchell [13/09/04 14:01]:
> [snip]
> >I know nothing about XS, unfortunately, but the way I imagine it is=20
> >that at some point, HTML::Parser calls the method given by text_h,=20
> >passing the text to it. So instead of just passing the text, I=20
> >suggest that it should pass decode_utf8 applied to the text.=20
> >Alternatively, call a fixed (usual perl) sub 'foo', giving it the=20
> >value of text_h and the text, and foo will apply decode_utf8 to the=20
> >text and than pass the result to text_h.
>=20
> The trouble is that there's no guarantee that in the general case, the=20
> input will always be UTF-8. At some point in all this, the input=20
> character encoding needs to be specified. Only from that can the=20
> appropriate action be taken.
>

Well, my opinion, at least, is that HTML::Parser should insist on having=20
(in the terminology of the Enocde docs) a perl string as input, rather=20
than octets. If your html happens to be a sequence of octets in some=20
encoding, you can convert it to a perl string using Encode prior to=20
passing it to HTML::Parser.

Thanks,
Moshe


--YZ5djTAD1cGYuMQK
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)

iD8DBQFBRYFQkBjmVsKMBeMRAuM8AKCI5LGGAav+5+jQCtq4SODxo4QQnQCf aCha
z2f75XDanvYfgH226H8GISI=
=aS6D
-----END PGP SIGNATURE-----

--YZ5djTAD1cGYuMQK--