HTML-Parser: storing into a DB words with special chars

HTML-Parser: storing into a DB words with special chars

am 21.09.2005 11:43:01 von tarmstrong

Hi.

Using Perl v5.8.5 + HTML-Parser v3.45 + MySQL v4.1.9 on Linux FC2.

I'm trying to parse a UTF-8 document ( -8"?>),=20
and store it into a MySQL database (Collation: utf8_bin).

The document contains special characters ('m=FAsica espa=F1ola' for instanc=
e), and
after storing it into de DB, I get this word: "música española"

I tried with:
----
utf8::decode($document);
my $p =3D HTML::TokeParser->new( \$document );
----

But it doesn't work. How can I store words with special characters? Regards=
..

Re: HTML-Parser: storing into a DB words with special chars

am 21.09.2005 12:51:13 von derhoermi

* thomas Armstrong wrote:
>The document contains special characters ('música española' for instance), and
>after storing it into de DB, I get this word: "música española"

How do you "get" it? You need to ensure that the database software
supports Unicode, that you properly store the data, and that you
properly retrieve and view the data. The string above is UTF-8 but
interpreted as in some other encoding (Windows-1252 or ISO-8859-1
for example). That's not a HTML::Parser issue.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: HTML-Parser: storing into a DB words with special chars

am 21.09.2005 17:00:18 von tarmstrong

Hi Bjoern.

Thank you very much for your answer. I have no idea about how MySQL stores
data, but if I do it by hand (via PHPMyAdmin) it works. And if I print
the Perl program
results on a TXT file, it shows right characters.

I'm going to post it to a MySQL mailing list.

Regards.

2005/9/21, Bjoern Hoehrmann :
> * thomas Armstrong wrote:
> >The document contains special characters ('m=FAsica espa=F1ola' for inst=
ance), and
> >after storing it into de DB, I get this word: "música española=
"
>
> How do you "get" it? You need to ensure that the database software
> supports Unicode, that you properly store the data, and that you
> properly retrieve and view the data. The string above is UTF-8 but
> interpreted as in some other encoding (Windows-1252 or ISO-8859-1
> for example). That's not a HTML::Parser issue.
> --
> Björn Höhrmann =B7 mailto:bjoern@hoehrmann.de =B7 http://bjoern.hoehr=
mann.de
> Weinh. Str. 22 =B7 Telefon: +49(0)621/4309674 =B7 http://www.bjoernsworld=
..de
> 68309 Mannheim =B7 PGP Pub. KeyID: 0xA4357E78 =B7 http://www.websitedev.d=
e/
>