HTML::Tree, Unicode, and utf-8

HTML::Tree, Unicode, and utf-8

am 17.11.2005 19:55:04 von metaperl

------=_Part_36512_14362108.1132253704674
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I would like to know how to place Unicode character sequences in an HTML
file whose charset is utf-8. The plain perl program below works fine for
this purpose, but I don't know what to do to get the HTML::TreeBuilder
version to work.

Also: I am not sure if this will remain the official support channel for
HTML::Tree now that it has changed hands, so I am cc'ing the new maintainer
as well.

# Working Program

use strict;
#use utf8;

my $string =3D "m\x{c3}\x{b8}\x{c3}\x{b8}se";

open O, '>moose.html' or die $!;

print O <<"EOHTML";





$string


EOHTML

# Fails to preserve unicode characters

use strict;
use HTML::TreeBuilder;


my $string =3D "m\x{c3}\x{b8}\x{c3}\x{b8}se";

open O, '>tbmoose.html' or die $!;

my $tree =3D HTML::TreeBuilder->new_from_content(<<"EOHTML");








EOHTML

my $body =3D $tree->look_down('_tag' =3D> 'body');
$body->push_content($string);

print O $tree->as_HTML;

------=_Part_36512_14362108.1132253704674--

Re: HTML::Tree, Unicode, and utf-8

am 17.11.2005 20:08:34 von metaperl

------=_Part_36856_26637564.1132254514991
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

I love it when I answer my own questions 5 minutes later. I had been
sweating over this all night into this morning. Now I finally figured it
out: create a super-literal (to use Sean Burke's terminology):

use strict;
use HTML::TreeBuilder;

use utf8;


my $string =3D "m\x{c3}\x{b8}\x{c3}\x{b8}se";
my $outfile =3D 'treebuild3.html';

open O, ">$outfile" or die $!;


my $tree =3D HTML::TreeBuilder->new_from_content(<<"EOHTML");








EOHTML

my $literal =3D HTML::Element->new('~literal', text =3D> $string); # THE WI=
NNING
LINE!

my $body =3D $tree->look_down('_tag' =3D> 'body');
$body->push_content($literal);


On 11/17/05, Terrence Brannon wrote:
>
> I would like to know how to place Unicode character sequences in an HTML
> file whose charset is utf-8. The plain perl program below works fine for
> this purpose, but I don't know what to do to get the HTML::TreeBuilder
> version to work.
>
> Also: I am not sure if this will remain the official support channel for
> HTML::Tree now that it has changed hands, so I am cc'ing the new maintain=
er
> as well.
>
> # Working Program
>
> use strict;
> #use utf8;
>
> my $string =3D "m\x{c3}\x{b8}\x{c3}\x{b8}se";
>
> open O, '>moose.html' or die $!;
>
> print O <<"EOHTML";
>
>
> />
>
>
> $string
>
>
> EOHTML
>
> # Fails to preserve unicode characters
>
> use strict;
> use HTML::TreeBuilder;
>
>
> my $string =3D "m\x{c3}\x{b8}\x{c3}\x{b8}se";
>
> open O, '>tbmoose.html' or die $!;
>
> my $tree =3D HTML::TreeBuilder->new_from_content(<<"EOHTML");
>
>
> />
>
>
>
>
>
> EOHTML
>
> my $body =3D $tree->look_down('_tag' =3D> 'body');
> $body->push_content($string);
>
> print O $tree->as_HTML;
>
>
>

------=_Part_36856_26637564.1132254514991--

Re: HTML::Tree, Unicode, and utf-8

am 17.11.2005 20:52:04 von kaminsky

--3MwIy2ne0vdjdPXF
Content-Type: text/plain; charset=utf-8
Content-Disposition: inline
Content-Transfer-Encoding: quoted-printable

I used to have problems with this, which appeared to have been solved in=20
a later version of HTML::Tree. Try using encode_utf8($string) (from the=20
Encode module) instead of $string

Moshe

* Terrence Brannon [17/11/05 21:06]:
> I would like to know how to place Unicode character sequences in an HTML
> file whose charset is utf-8. The plain perl program below works fine for
> this purpose, but I don't know what to do to get the HTML::TreeBuilder
> version to work.
>=20
> Also: I am not sure if this will remain the official support channel for
> HTML::Tree now that it has changed hands, so I am cc'ing the new maintain=
er
> as well.
>=20
> # Working Program
>=20
> use strict;
> #use utf8;
>=20
> my $string =3D "m\x{c3}\x{b8}\x{c3}\x{b8}se";
>=20
> open O, '>moose.html' or die $!;
>=20
> print O <<"EOHTML";
>
>
> />
>
>
> $string
>
>
> EOHTML
>=20
> # Fails to preserve unicode characters
>=20
> use strict;
> use HTML::TreeBuilder;
>=20
>=20
> my $string =3D "m\x{c3}\x{b8}\x{c3}\x{b8}se";
>=20
> open O, '>tbmoose.html' or die $!;
>=20
> my $tree =3D HTML::TreeBuilder->new_from_content(<<"EOHTML");
>
>
> />
>
>
>=20
>
>
> EOHTML
>=20
> my $body =3D $tree->look_down('_tag' =3D> 'body');
> $body->push_content($string);
>=20
> print O $tree->as_HTML;

--3MwIy2ne0vdjdPXF
Content-Type: application/pgp-signature
Content-Disposition: inline

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2-ecc0.1.6 (GNU/Linux)

iD8DBQFDfN9kkBjmVsKMBeMRAsUGAKCgKGd36w7afBAV5bycMqI2xALD2gCg k+EV
9aQz/H86SPecyLJAP8Vwjs8=
=vRAv
-----END PGP SIGNATURE-----

--3MwIy2ne0vdjdPXF--

Re: HTML::Tree, Unicode, and utf-8

am 18.11.2005 10:28:34 von paul.bijnens

Terrence Brannon wrote:
> I would like to know how to place Unicode character sequences in an HTML
> file whose charset is utf-8. The plain perl program below works fine for
> this purpose, but I don't know what to do to get the HTML::TreeBuilder
> version to work.
>
> Also: I am not sure if this will remain the official support channel for
> HTML::Tree now that it has changed hands, so I am cc'ing the new maintainer
> as well.
>
> # Working Program
>
> use strict;
> #use utf8;
>
> my $string = "m\x{c3}\x{b8}\x{c3}\x{b8}se";
>
> open O, '>moose.html' or die $!;
>
> print O <<"EOHTML";
>
>
>
>
>
> $string
>
>
> EOHTML

This works because perl handles all strings as byte sequences, and
there is no notion of utf8 in ths program. You just constructed a
file with the correct byte sequence in it. The interpretation of
that bytesequence as utf8 happens in the browser.


>
> # Fails to preserve unicode characters
>
> use strict;
> use HTML::TreeBuilder;
>
>
> my $string = "m\x{c3}\x{b8}\x{c3}\x{b8}se";
>
> open O, '>tbmoose.html' or die $!;
>
> my $tree = HTML::TreeBuilder->new_from_content(<<"EOHTML");
>
>
>
>
>
>
>
>
> EOHTML
>
> my $body = $tree->look_down('_tag' => 'body');
> $body->push_content($string);
>
> print O $tree->as_HTML;
>

However TreeBuilder does have knowledge of encodings, and knows that
when inserting strings that strings needs to be in charset indicated,
UTF-8 in your case. The string that you supplied however does NOT
have the utf8 flag set. So TreeBuilder needs to upgrade the simple
byte string to a unicode string (implicit upgrading assumes they were
in ISO-8859-1 or latin1).

Setting the utf8 flag on $string is enough to make the program work.
Replace "my $string =..." with:

use Encode;
my $string = decode_utf8("m\x{c3}\x{b8}\x{c3}\x{b8}se");

or just force the utf8 flag (deprecated):

use Encode;
my $string = "m\x{c3}\x{b8}\x{c3}\x{b8}se";
Encode::_utf8_on($string);

You'll need to tell perl what encoding a string is in. If you don't
perl assumes it is a byte string. Implicit upgrading from byte strings
to Unicode strings assumes that they were encoded in ISO 8859-1 (Latin-1).

In a world without character sets, (like perl before unicode), all
strings are just bytesequences. And perl handles those very good,
as it did from the early days. When you need to handle strings
as text, with accented letters etc, you need to tell perl what
encoding a bytestring is, and then perl internally upgrades the
string to a complex structure suitable for handling text, instead
of bytestrings. That text structure happens to represent text in utf8,
and in reality is not very complicated and still efficient. But it
is different from a simple byte string.

Don't assume the utf8-flag is "on" when a string is valid utf8.
That's why I used decode_utf8() to decode the bytesequence as utf8.
I've seen people make the same error when the write a utf8 string
to a file, and later read it again, and wonder why perl doesn't
recognize it as utf8. You still need to tell perl it is utf8,
e.g. by setting "binmode(FILE, ':utf8')".

I've needed to read the "perlunicode" man page several times to
understand this (or I'm just too stupid to understand it while reading
the documentation the first time).


--
Paul Bijnens, Xplanation Tel +32 16 397.511
Technologielaan 21 bus 2, B-3001 Leuven, BELGIUM Fax +32 16 397.512
http://www.xplanation.com/ email: Paul.Bijnens@xplanation.com
************************************************************ ***********
* I think I've got the hang of it now: exit, ^D, ^C, ^\, ^Z, ^Q, ^^, *
* F6, quit, ZZ, :q, :q!, M-Z, ^X^C, logoff, logout, close, bye, /bye, *
* stop, end, F3, ~., ^]c, +++ ATH, disconnect, halt, abort, hangup, *
* PF4, F20, ^X^X, :D::D, KJOB, F14-f-e, F8-e, kill -1 $$, shutdown, *
* init 0, kill -9 1, Alt-F4, Ctrl-Alt-Del, AltGr-NumLock, Stop-A, ... *
* ... "Are you sure?" ... YES ... Phew ... I'm out *
************************************************************ ***********

Re: HTML::Tree, Unicode, and utf-8

am 19.11.2005 01:42:52 von Andy

>
> Also: I am not sure if this will remain the official support
> channel for HTML::Tree now that it has changed hands, so I am
> cc'ing the new maintainer as well.

Might as well stay here.

--
Andy Lester => andy@petdance.com => www.petdance.com => AIM:petdance