Encoding

Encoding

am 31.08.2006 18:08:21 von zeman

Hi,
I wonder if anyone can help me.

I am using
- Debian Linux
- Perl 5.8.8
- DBI (I do not know how to figure out its version)
- MySQL 5.0.22-Debian_3-log

I want to store and handle data in UTF-8 but so far I have not been able
to force Perl/DBI to do so.

I have created a table using

my $sql = "CREATE TABLE $tbl (".join(", ", @columns).") CHARACTER
SET utf8 COLLATE utf8_czech_ci;";
$dbh->do($sql);

I feeded the table with data using

my $list_of_columns = join(", ", @names);
my $list_of_values = join(", ", map{"_utf8'$record->{$_}'"}(@names));
my $sql = "INSERT INTO $tbl ($list_of_columns) VALUES
($list_of_values);";
$dbh->do($sql);

I have looked into the database using phpMyAdmin 2.8.2-Debian-0.1 and it
really looked like the data were stored in correct UTF-8.

However, when I retrieve the data from Perl/DBI, something in the chain
(MySQL? the driver? DBI?) decides that another encoding (probably,
Latin1) would be better for me. It "converts" the strings from UTF-8 to
that encoding, which means, at the time the data arrives in my Perl
code, all the non-Latin1 characters have already been irrecoverably
converted to question marks. I would be happy to decode the data myself
but there is nothing I can do with the question marks.

I am using the following code to retrieve the data:

my $sql = "SELECT kod, hry.nazev FROM hry INNER JOIN prodej ON
hry.kod = prodej.kod_hry GROUP BY kod, hry.nazev";
my $sqlobj = $dbh->prepare($sql);
$sqlobj->execute();
while(my ($kod, $nazev) = $sqlobj->fetchrow_array())
{
...
}

So far, the only workaround I have, is not to tell the DBI the data is
UTF-8 when I am inserting it (i.e., drop the "_utf8" part before the
single quote), and use Encode; decode("utf8", ...) on anything I fetch
from the database. This way, the database never knows the data was a
UTF-8 text, treats the bytes as Latin1 characters and returns them
undisturbed. However, I cannot access the data using phpMyAdmin (unless
I en/decode UTF in my brain), the string lengths do not reflect the
reality etc.

Is there a better way to do it? I think there must be some small stupid
locale-like setting telling the machine that I am a UTF guy. But the
settings I was able to come up with did not help and I actually have no
idea which part of the MySQL-driver-DBI-Perl chain is responsible.

Any hints are welcome.
Thanks

Dan

RE: Encoding

am 04.09.2006 13:32:00 von Martin.Evans

On 31-Aug-2006 Daniel Zeman wrote:
> Hi,
> I wonder if anyone can help me.
>
> I am using
> - Debian Linux
> - Perl 5.8.8
> - DBI (I do not know how to figure out its version)

perl -MDBI -e 'print "$DBI::VERSION\n";'

> - MySQL 5.0.22-Debian_3-log

What DBD::mysql version are you using?

perl -M'DBD::mysql' -e 'print "$DBD::mysql::VERSION\n";'

> I want to store and handle data in UTF-8 but so far I have not been able
> to force Perl/DBI to do so.
>
> I have created a table using
>
> my $sql = "CREATE TABLE $tbl (".join(", ", @columns).") CHARACTER
> SET utf8 COLLATE utf8_czech_ci;";
> $dbh->do($sql);
>
> I feeded the table with data using
>
> my $list_of_columns = join(", ", @names);
> my $list_of_values = join(", ", map{"_utf8'$record->{$_}'"}(@names));
> my $sql = "INSERT INTO $tbl ($list_of_columns) VALUES
> ($list_of_values);";
> $dbh->do($sql);

You would be better using bound parameters as then only the data can be utf-8.

> I have looked into the database using phpMyAdmin 2.8.2-Debian-0.1 and it
> really looked like the data were stored in correct UTF-8.
>
> However, when I retrieve the data from Perl/DBI, something in the chain
> (MySQL? the driver? DBI?) decides that another encoding (probably,
> Latin1) would be better for me. It "converts" the strings from UTF-8 to
> that encoding, which means, at the time the data arrives in my Perl
> code, all the non-Latin1 characters have already been irrecoverably
> converted to question marks. I would be happy to decode the data myself
> but there is nothing I can do with the question marks.
>
> I am using the following code to retrieve the data:
>
> my $sql = "SELECT kod, hry.nazev FROM hry INNER JOIN prodej ON
> hry.kod = prodej.kod_hry GROUP BY kod, hry.nazev";
> my $sqlobj = $dbh->prepare($sql);
> $sqlobj->execute();
> while(my ($kod, $nazev) = $sqlobj->fetchrow_array())
> {
> ...
> }
>
> So far, the only workaround I have, is not to tell the DBI the data is
> UTF-8 when I am inserting it (i.e., drop the "_utf8" part before the
> single quote), and use Encode; decode("utf8", ...) on anything I fetch
> from the database. This way, the database never knows the data was a
> UTF-8 text, treats the bytes as Latin1 characters and returns them
> undisturbed. However, I cannot access the data using phpMyAdmin (unless
> I en/decode UTF in my brain), the string lengths do not reflect the
> reality etc.
>
> Is there a better way to do it? I think there must be some small stupid
> locale-like setting telling the machine that I am a UTF guy. But the
> settings I was able to come up with did not help and I actually have no
> idea which part of the MySQL-driver-DBI-Perl chain is responsible.
>
> Any hints are welcome.
> Thanks
>
> Dan

I posted a small patch (in dbi-dev a couple of months ago) to a recent(ish)
DBD::mysql which allows you to retrieve utf-8 data marked as utf-8 in perl. If
is not a complete solution but works for most things.

See
http://www.nntp.perl.org/group/perl.dbi.dev/4548

It may have been included in the 3.0006_1 development release of DBD::mysql.

You also need to look up "set character set utf8" and "set names utf8" in mysql
documentation.

Martin
--
Martin J. Evans
Easysoft Ltd, UK
http://www.easysoft.com

RE: Encoding

am 04.09.2006 13:53:40 von h-taguchi

> perl -MDBI -e 'print "$DBI::VERSION\n";'

Less typed:

perl -MDBI -le 'print DBI->VERSION'

Regards,

Re: Encoding

am 04.09.2006 20:13:21 von zeman

Martin J. Evans napsal(a):
> On 31-Aug-2006 Daniel Zeman wrote:
>> Hi,
>> I wonder if anyone can help me.
>>
>> I am using
>> - Debian Linux
>> - Perl 5.8.8
>> - DBI (I do not know how to figure out its version)
>
> perl -MDBI -e 'print "$DBI::VERSION\n";'

1.51

>
>> - MySQL 5.0.22-Debian_3-log
>
> What DBD::mysql version are you using?
>
> perl -M'DBD::mysql' -e 'print "$DBD::mysql::VERSION\n";'

3.0004

>
>> I want to store and handle data in UTF-8 but so far I have not been
able
>> to force Perl/DBI to do so.
>>
>> I have created a table using
>>
>> my $sql = "CREATE TABLE $tbl (".join(", ", @columns).") CHARACTER
>> SET utf8 COLLATE utf8_czech_ci;";
>> $dbh->do($sql);
>>
>> I feeded the table with data using
>>
>> my $list_of_columns = join(", ", @names);
>> my $list_of_values = join(", ",
map{"_utf8'$record->{$_}'"}(@names));
>> my $sql = "INSERT INTO $tbl ($list_of_columns) VALUES
>> ($list_of_values);";
>> $dbh->do($sql);
>
> You would be better using bound parameters as then only the data can
be utf-8.
>
>> I have looked into the database using phpMyAdmin 2.8.2-Debian-0.1
and it
>> really looked like the data were stored in correct UTF-8.
>>
>> However, when I retrieve the data from Perl/DBI, something in the chain
>> (MySQL? the driver? DBI?) decides that another encoding (probably,
>> Latin1) would be better for me. It "converts" the strings from UTF-8 to
>> that encoding, which means, at the time the data arrives in my Perl
>> code, all the non-Latin1 characters have already been irrecoverably
>> converted to question marks. I would be happy to decode the data myself
>> but there is nothing I can do with the question marks.
>>
>> I am using the following code to retrieve the data:
>>
>> my $sql = "SELECT kod, hry.nazev FROM hry INNER JOIN prodej ON
>> hry.kod = prodej.kod_hry GROUP BY kod, hry.nazev";
>> my $sqlobj = $dbh->prepare($sql);
>> $sqlobj->execute();
>> while(my ($kod, $nazev) = $sqlobj->fetchrow_array())
>> {
>> ...
>> }
>>
>> So far, the only workaround I have, is not to tell the DBI the data is
>> UTF-8 when I am inserting it (i.e., drop the "_utf8" part before the
>> single quote), and use Encode; decode("utf8", ...) on anything I fetch
>> from the database. This way, the database never knows the data was a
>> UTF-8 text, treats the bytes as Latin1 characters and returns them
>> undisturbed. However, I cannot access the data using phpMyAdmin (unless
>> I en/decode UTF in my brain), the string lengths do not reflect the
>> reality etc.
>>
>> Is there a better way to do it? I think there must be some small stupid
>> locale-like setting telling the machine that I am a UTF guy. But the
>> settings I was able to come up with did not help and I actually have no
>> idea which part of the MySQL-driver-DBI-Perl chain is responsible.
>>
>> Any hints are welcome.
>> Thanks
>>
>> Dan
>
> I posted a small patch (in dbi-dev a couple of months ago) to a
recent(ish)
> DBD::mysql which allows you to retrieve utf-8 data marked as utf-8 in
perl. If
> is not a complete solution but works for most things.
>
> See
> http://www.nntp.perl.org/group/perl.dbi.dev/4548
>
> It may have been included in the 3.0006_1 development release of
DBD::mysql.
>
> You also need to look up "set character set utf8" and "set names
utf8" in mysql
> documentation.

This is it! Thanks a lot! Just doing

$dbh->prepare("SET NAMES 'utf8'")->execute();

after connecting solved my problem. It seems that I am getting the
strings with the utf8 flag on so hopefully I can live without your patch
(or someone has already patched the system I am on) - but I am going
to store the link for future reference. Thanks!

Dan

Re: Encoding

am 04.09.2006 20:21:13 von Martin.Evans

On Mon, 2006-09-04 at 14:13 -0400, Daniel Zeman wrote:
> Martin J. Evans napsal(a):
> > On 31-Aug-2006 Daniel Zeman wrote:
> >> Hi,
> >> I wonder if anyone can help me.
> >>
> >> I am using
> >> - Debian Linux
> >> - Perl 5.8.8
> >> - DBI (I do not know how to figure out its version)
> >
> > perl -MDBI -e 'print "$DBI::VERSION\n";'
>
> 1.51
>
> >
> >> - MySQL 5.0.22-Debian_3-log
> >
> > What DBD::mysql version are you using?
> >
> > perl -M'DBD::mysql' -e 'print "$DBD::mysql::VERSION\n";'
>
> 3.0004
>
> >
> >> I want to store and handle data in UTF-8 but so far I have not been
> able
> >> to force Perl/DBI to do so.
> >>
> >> I have created a table using
> >>
> >> my $sql = "CREATE TABLE $tbl (".join(", ", @columns).") CHARACTER
> >> SET utf8 COLLATE utf8_czech_ci;";
> >> $dbh->do($sql);
> >>
> >> I feeded the table with data using
> >>
> >> my $list_of_columns = join(", ", @names);
> >> my $list_of_values = join(", ",
> map{"_utf8'$record->{$_}'"}(@names));
> >> my $sql = "INSERT INTO $tbl ($list_of_columns) VALUES
> >> ($list_of_values);";
> >> $dbh->do($sql);
> >
> > You would be better using bound parameters as then only the data can
> be utf-8.
> >
> >> I have looked into the database using phpMyAdmin 2.8.2-Debian-0.1
> and it
> >> really looked like the data were stored in correct UTF-8.
> >>
> >> However, when I retrieve the data from Perl/DBI, something in the chain
> >> (MySQL? the driver? DBI?) decides that another encoding (probably,
> >> Latin1) would be better for me. It "converts" the strings from UTF-8 to
> >> that encoding, which means, at the time the data arrives in my Perl
> >> code, all the non-Latin1 characters have already been irrecoverably
> >> converted to question marks. I would be happy to decode the data myself
> >> but there is nothing I can do with the question marks.
> >>
> >> I am using the following code to retrieve the data:
> >>
> >> my $sql = "SELECT kod, hry.nazev FROM hry INNER JOIN prodej ON
> >> hry.kod = prodej.kod_hry GROUP BY kod, hry.nazev";
> >> my $sqlobj = $dbh->prepare($sql);
> >> $sqlobj->execute();
> >> while(my ($kod, $nazev) = $sqlobj->fetchrow_array())
> >> {
> >> ...
> >> }
> >>
> >> So far, the only workaround I have, is not to tell the DBI the data is
> >> UTF-8 when I am inserting it (i.e., drop the "_utf8" part before the
> >> single quote), and use Encode; decode("utf8", ...) on anything I fetch
> >> from the database. This way, the database never knows the data was a
> >> UTF-8 text, treats the bytes as Latin1 characters and returns them
> >> undisturbed. However, I cannot access the data using phpMyAdmin (unless
> >> I en/decode UTF in my brain), the string lengths do not reflect the
> >> reality etc.
> >>
> >> Is there a better way to do it? I think there must be some small stupid
> >> locale-like setting telling the machine that I am a UTF guy. But the
> >> settings I was able to come up with did not help and I actually have no
> >> idea which part of the MySQL-driver-DBI-Perl chain is responsible.
> >>
> >> Any hints are welcome.
> >> Thanks
> >>
> >> Dan
> >
> > I posted a small patch (in dbi-dev a couple of months ago) to a
> recent(ish)
> > DBD::mysql which allows you to retrieve utf-8 data marked as utf-8 in
> perl. If
> > is not a complete solution but works for most things.
> >
> > See
> > http://www.nntp.perl.org/group/perl.dbi.dev/4548
> >
> > It may have been included in the 3.0006_1 development release of
> DBD::mysql.
> >
> > You also need to look up "set character set utf8" and "set names
> utf8" in mysql
> > documentation.
>
> This is it! Thanks a lot! Just doing
>
> $dbh->prepare("SET NAMES 'utf8'")->execute();
>
> after connecting solved my problem. It seems that I am getting the
> strings with the utf8 flag on so hopefully I can live without your patch
> (or someone has already patched the system I am on) - but I am going
> to store the link for future reference. Thanks!
>
> Dan
>
>
Glad to hear this sorted it. On reflection the patch I supplied was
against DBD::mysql 3.0003_1 so this might explain it.

Martin
--
Martin J. Evans
Easysoft Limited
http://www.easysoft.com