Re: Adding utf8 support to DBD::mysql

Re: Adding utf8 support to DBD::mysql

am 30.04.2006 22:45:57 von Martin.Evans

Patrick Galbraith wrote:
> Martin J. Evans wrote:
>
> Martin,
>
> Thanks much! This is dbdimp.c, right?
Yes

> I will add this tomorrow (not
> working today),

and I'm stricly not working tomorrow - bank holiday - but it is likely
I'll find time to do something.

> and test it out. If everything is a go, then I'll roll
> up a dev 3.0003_1 releaseAlso, I've mentioned giving you write access to
> SVN for DBD::mysql - do you want that? Also, what is your cpan username?

Argh, tend to be a submitter/commentor rather than a creator.
I've written DBIx::Log4perl but still have not submitted it
as I have not yet got a cpan username.

If you are happy giving me write access to DBD::mysql I'm happy to
contribute although I think you've got most of my stuff now. If
all it means is getting a cpan username and that is straight forward
I'll do that also.

The utf8 patch is very much a quick hack but it you were to submit
the other patches to dbd::mysql first I'd happily investigate/test
further. The reason I say this is that I'm finding it increasingly
difficult to submit patches on top of patches that are not comitted
yet.

Martin


> Kind regards,
>
> Patrick
>
>> Shamelessy stolen from Michael Kröll and against a patch on dbd::mysql
>> on subversion I've sent to Patrick which is not committed yet:
>>
>> @@ -2915,6 +2914,9 @@
>> if (dbis->debug >= 2)
>> PerlIO_printf(DBILOGFP, "st_fetch string data %s\n",
>> fbh->data);
>> sv_setpvn(sv, fbh->data, fbh->length);
>> + if (is_high_bit_set(fbh->data)) {
>> + SvUTF8_on(sv);
>> + }
>> break;
>> @@ -2968,6 +2970,10 @@
>> { --len; }
>> }
>> sv_setpvn(sv, col, len);
>> + if (is_high_bit_set(col)) {
>> + SvUTF8_on(sv);
>> + }
>> +
>> @@ -3881,3 +3901,11 @@
>> return
>> sv_2mortal(my_ulonglong2str(mysql_insert_id(&((imp_dbh_t*)im p_dbh)->mysql)));
>>
>> }
>> #endif
>> +
>> +int is_high_bit_set(val)
>> + char *val;
>> +{
>> + while (*val)
>> + if (*val++ & 0x80) return 1;
>> + return 0;
>> +}
>>
>>
>> would appear to make the previous code (below and without the Encode)
>> I posted
>> work fine.
>>
>> I hasten to add I've not tested this much (other than the posted test
>> code).
>>
>> What I'm personally interested in is whether retrieving utf data from
>> DBD::mysql and pushing it through JSON, JSON->objToJson works and I
>> cannot
>> easily test this until Tuesday.
>>
>> If Patrick can commit my previous changes I'll test this more.
>>
>> Martin
>> --
>> Martin J. Evans
>> Easysoft Ltd, UK
>> http://www.easysoft.com
>>
>>
>> On 30-Apr-2006 Martin J. Evans wrote:
>>
>>
>>> I hope I'm not muddying the waters but dbd::mysql UTF support for
>>> returned
>>> data (not metadata) seems to be nearly there. I need UTF8 support on
>>> at least inserted and returned results-sets although I'm less bothered
>>> by UTF8 table/column names. It would seem that if you define your
>>> table as
>>> using UTF8 then insertion is not a problem but retrieval is. The
>>> following code nearly works - it is just the setting of the utf8 flag on
>>> the returned data that is wrong:
>>>
>>> #!/usr/bin/perl -w
>>> use strict;
>>> use DBI qw(:utils);
>>> use charnames ':full';
>>> use Encode;
>>>
>>> print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
>>> print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
>>> my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}"; #
>>> smiley
>>> print join(" ", unpack("H*", $str)), "\n";
>>> print "length(str) = ", length($str), "\n";
>>> print "bytes::length(str) = ", bytes::length($str), "\n";
>>>
>>> print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
>>> print "data_string_desc: ", data_string_desc($str),"\n";
>>>
>>> open OUT, ">uni.out";
>>> binmode(OUT, ":utf8");
>>> print OUT "$str\n";
>>> # data written to uni.out is UTF8
>>>
>>> my $dbh = DBI->connect("dbi:mysql:test", "xxx", "xxx");
>>> # there are posts on dbi-user as to whether both or either of
>>> # the following should be set
>>> $dbh->do("set character set utf8");
>>> $dbh->do("set names utf8");
>>> $dbh->do("drop table if exists utf");
>>> $dbh->do("create table utf (a char(100)) default charset utf8");
>>> my $sth = $dbh->prepare("insert into utf values (?)");
>>> $sth->execute($str);
>>> $sth = $dbh->prepare("select * from utf");
>>> $sth->execute;
>>>
>>> my @row = $sth->fetchrow_array;
>>> print "data_string_desc (after fetch): ",
>>> data_string_desc($row[0]),"\n";
>>> # the following shows we'e got the right data back
>>> # but perl does not know it is utf8
>>> print join(" ", unpack("H*", $row[0])), "\n";
>>> # turning on utf8 causes the rignt uf8 sequence to be output
>>> # and hence sv_utf8_upgrade(sv) will probably work
>>> Encode::_utf8_on($row[0]);
>>> print "data_string_desc (after fetch): ",
>>> data_string_desc($row[0]),"\n";
>>>
>>> open OUT, ">utf.out";
>>> binmode (OUT, ":utf8");
>>> print OUT $row[0];
>>> close OUT;
>>> # data written to utf.out is not UTF8 unless is marked utf8
>>>
>>> produces:
>>>
>>> Is utf8::is_utf8 defined: 1
>>> Is utf8::valid defined: 1
>>> e298ba787878d790d8a7
>>> length(str) = 6
>>> bytes::length(str) = 10
>>> utf8::is_utf8 = 1
>>> data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
>>> data_string_desc (after fetch): UTF8 off, non-ASCII, 10 characters 10
>>> bytes
>>> e298ba787878d790d8a7
>>>
>>> with utf.out containing:
>>>
>>> <98>xxx<97><90><98>
>>>
>>> without that Encode::_utf8_on($row[0]);
>>>
>>> Michael Kröll (apr-10-2006)
>>> posted a change for DBD::db2 which seemed to sort this out for DB2
>>> so a similar change could be added to mysql.
>>>
>>> Hope this helps.
>>>
>>> Martin
>>> --
>>> Martin J. Evans
>>> Easysoft Ltd, UK
>>> http://www.easysoft.com
>>>
>>>
>>> On 24-Apr-2006 Tim Bunce wrote:
>>>
>>>
>>>> [I'm at the mysql conference and Patrick asked me about adding utf8
>>>> support to DBD::mysql. I said I'd look at the libmysql docs and give my
>>>> thoughts. I'm posting to dbi-dev since it may be of interest to others
>>>> interested in enhancing DBD::mysql and to other driver developers.
>>>> These are just random thoughts from a quick look at the docs.]
>>>>
>>>> The keys mysql docs seem to be
>>>> http://dev.mysql.com/doc/refman/4.1/en/charset-connection.ht ml
>>>>
>>>> The mysql api and client->server protocol doesn't support passing
>>>> characterset info to the server on a per-statement / per-bind value
>>>> basis.
>>>> (http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-state ment-datatypes.ht
>>>>
>>>> m
>>>> l)
>>>> So the sane way to send utf8 to the server is by setting the
>>>> 'connection
>>>> character set' to utf8 and then only sending utf8 (or its ASCII subset)
>>>> to the server on that connection.
>>>>
>>>> *** Fetching data:
>>>>
>>>> MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD
>>>> structure.
>>>> It's the "character set number for the field".
>>>>
>>>> So set the UTF8 flag based on that value. Something like:
>>>> (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>>>> I couldn't see any docs for the values of the charsetnr field.
>>>>
>>>> Also, would be good to enable perl code to access the charsetnr values:
>>>> $sth->{mysql_charsetnr}->[$i]
>>>>
>>>> *** Fetching Metadata:
>>>>
>>>> The above is a minimum. It doesn't address metadata like field names
>>>> ($sth->{NAME}) that might also be in utf8. For that the driver needs to
>>>> know if the 'connection character set' is currently utf8.
>>>>
>>>> (The docs mention mysql->charset but it's not clear if that's part of
>>>> the public API.)
>>>>
>>>> However it's detected, the code needs to end up doing:
>>>> (...connection charset is utf8...) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>>>> on the metadata.
>>>>
>>>>
>>>> *** SET NAMES '...'
>>>>
>>>> Intercept SET NAMES and call the mysql_set_character_set() API instead.
>>>> See http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-s et.html
>>>>
>>>>
>>>> *** Detecting Inconsistencies
>>>>
>>>> If the connection character set is _not_ utf8 but the application calls
>>>> the driver with data (or SQL statement) that has the UTF8 flag set,
>>>> then
>>>> it could issue a warning. In practice that may be to be too noisy for
>>>> people that done their own workarounds for utf8 support. If so then
>>>> they could be changes to level 1 trace messages.
>>>>
>>>> If the connection character set _is_ utf8, and the application calls
>>>> the driver with data (or SQL statement) that does _not_ have the UTF8
>>>> flag set but _does_ have bytes with the high bit set, then the driver
>>>> should issue a warning. The checking for high bit set is an extra cost
>>>> so this should only be enabled if tracing and/or an attribute is set
>>>> (perhaps called $dbh->{mysql_charset_checks} = 1)
>>>>
>>>> Tim.
>>>>
>
>
>
>

Re: Adding utf8 support to DBD::mysql

am 30.04.2006 23:00:28 von patg

Martin J. Evans wrote:

Martin,

Sure, I'll be glad to put in the code. If you can come up with tests,
that's one thing we need. Whatever you like, and whatever works for you.
I'll always give you due credit.

I also would like to know how to have a test end its while loop if a
version doesn't support the server it's running against - if you know
that, that would be fabulous. I came up with a way by querying the
database, then checking if a string is /^4/ or /^3/ (stored procs), if
it matches that, set $state = 1. I think there may be a better way!
(well, that'd be Test::More I think)

Before I roll a dev release, I'd like to make sure I have looked at and
added all of your patches that you have, as well as have tests for them.

Again, Thank you!

Patrick

> Patrick Galbraith wrote:
>
>> Martin J. Evans wrote:
>>
>> Martin,
>>
>> Thanks much! This is dbdimp.c, right?
>
> Yes
>
>> I will add this tomorrow (not working today),
>
>
> and I'm stricly not working tomorrow - bank holiday - but it is likely
> I'll find time to do something.
>
>> and test it out. If everything is a go, then I'll roll up a dev
>> 3.0003_1 releaseAlso, I've mentioned giving you write access to SVN
>> for DBD::mysql - do you want that? Also, what is your cpan username?
>
>
> Argh, tend to be a submitter/commentor rather than a creator.
> I've written DBIx::Log4perl but still have not submitted it
> as I have not yet got a cpan username.
>
> If you are happy giving me write access to DBD::mysql I'm happy to
> contribute although I think you've got most of my stuff now. If
> all it means is getting a cpan username and that is straight forward
> I'll do that also.
>
> The utf8 patch is very much a quick hack but it you were to submit
> the other patches to dbd::mysql first I'd happily investigate/test
> further. The reason I say this is that I'm finding it increasingly
> difficult to submit patches on top of patches that are not comitted
> yet.
>
> Martin
>
>
>> Kind regards,
>>
>> Patrick
>>
>>> Shamelessy stolen from Michael Kröll and against a patch on dbd::mysql
>>> on subversion I've sent to Patrick which is not committed yet:
>>>
>>> @@ -2915,6 +2914,9 @@
>>> if (dbis->debug >= 2)
>>> PerlIO_printf(DBILOGFP, "st_fetch string data %s\n",
>>> fbh->data);
>>> sv_setpvn(sv, fbh->data, fbh->length);
>>> + if (is_high_bit_set(fbh->data)) {
>>> + SvUTF8_on(sv);
>>> + }
>>> break;
>>> @@ -2968,6 +2970,10 @@
>>> { --len; }
>>> }
>>> sv_setpvn(sv, col, len);
>>> + if (is_high_bit_set(col)) {
>>> + SvUTF8_on(sv);
>>> + }
>>> + @@ -3881,3 +3901,11 @@
>>> return
>>> sv_2mortal(my_ulonglong2str(mysql_insert_id(&((imp_dbh_t*)im p_dbh)->mysql)));
>>>
>>> }
>>> #endif
>>> +
>>> +int is_high_bit_set(val)
>>> + char *val;
>>> +{
>>> + while (*val)
>>> + if (*val++ & 0x80) return 1;
>>> + return 0;
>>> +}
>>>
>>>
>>> would appear to make the previous code (below and without the
>>> Encode) I posted
>>> work fine.
>>>
>>> I hasten to add I've not tested this much (other than the posted
>>> test code).
>>>
>>> What I'm personally interested in is whether retrieving utf data from
>>> DBD::mysql and pushing it through JSON, JSON->objToJson works and I
>>> cannot
>>> easily test this until Tuesday.
>>>
>>> If Patrick can commit my previous changes I'll test this more.
>>>
>>> Martin
>>> --
>>> Martin J. Evans
>>> Easysoft Ltd, UK
>>> http://www.easysoft.com
>>>
>>>
>>> On 30-Apr-2006 Martin J. Evans wrote:
>>>
>>>
>>>> I hope I'm not muddying the waters but dbd::mysql UTF support for
>>>> returned
>>>> data (not metadata) seems to be nearly there. I need UTF8 support on
>>>> at least inserted and returned results-sets although I'm less bothered
>>>> by UTF8 table/column names. It would seem that if you define your
>>>> table as
>>>> using UTF8 then insertion is not a problem but retrieval is. The
>>>> following code nearly works - it is just the setting of the utf8
>>>> flag on
>>>> the returned data that is wrong:
>>>>
>>>> #!/usr/bin/perl -w
>>>> use strict;
>>>> use DBI qw(:utils);
>>>> use charnames ':full';
>>>> use Encode;
>>>>
>>>> print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
>>>> print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
>>>> my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}"; #
>>>> smiley
>>>> print join(" ", unpack("H*", $str)), "\n";
>>>> print "length(str) = ", length($str), "\n";
>>>> print "bytes::length(str) = ", bytes::length($str), "\n";
>>>>
>>>> print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
>>>> print "data_string_desc: ", data_string_desc($str),"\n";
>>>>
>>>> open OUT, ">uni.out";
>>>> binmode(OUT, ":utf8");
>>>> print OUT "$str\n";
>>>> # data written to uni.out is UTF8
>>>>
>>>> my $dbh = DBI->connect("dbi:mysql:test", "xxx", "xxx");
>>>> # there are posts on dbi-user as to whether both or either of
>>>> # the following should be set
>>>> $dbh->do("set character set utf8");
>>>> $dbh->do("set names utf8");
>>>> $dbh->do("drop table if exists utf");
>>>> $dbh->do("create table utf (a char(100)) default charset utf8");
>>>> my $sth = $dbh->prepare("insert into utf values (?)");
>>>> $sth->execute($str);
>>>> $sth = $dbh->prepare("select * from utf");
>>>> $sth->execute;
>>>>
>>>> my @row = $sth->fetchrow_array;
>>>> print "data_string_desc (after fetch): ",
>>>> data_string_desc($row[0]),"\n";
>>>> # the following shows we'e got the right data back
>>>> # but perl does not know it is utf8
>>>> print join(" ", unpack("H*", $row[0])), "\n";
>>>> # turning on utf8 causes the rignt uf8 sequence to be output
>>>> # and hence sv_utf8_upgrade(sv) will probably work
>>>> Encode::_utf8_on($row[0]);
>>>> print "data_string_desc (after fetch): ",
>>>> data_string_desc($row[0]),"\n";
>>>>
>>>> open OUT, ">utf.out";
>>>> binmode (OUT, ":utf8");
>>>> print OUT $row[0];
>>>> close OUT;
>>>> # data written to utf.out is not UTF8 unless is marked utf8
>>>>
>>>> produces:
>>>>
>>>> Is utf8::is_utf8 defined: 1
>>>> Is utf8::valid defined: 1
>>>> e298ba787878d790d8a7
>>>> length(str) = 6
>>>> bytes::length(str) = 10
>>>> utf8::is_utf8 = 1
>>>> data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
>>>> data_string_desc (after fetch): UTF8 off, non-ASCII, 10 characters
>>>> 10 bytes
>>>> e298ba787878d790d8a7
>>>>
>>>> with utf.out containing:
>>>>
>>>> <98>xxx<97><90><98>
>>>>
>>>> without that Encode::_utf8_on($row[0]);
>>>>
>>>> Michael Kröll (apr-10-2006)
>>>> posted a change for DBD::db2 which seemed to sort this out for DB2
>>>> so a similar change could be added to mysql.
>>>>
>>>> Hope this helps.
>>>>
>>>> Martin
>>>> --
>>>> Martin J. Evans
>>>> Easysoft Ltd, UK
>>>> http://www.easysoft.com
>>>>
>>>>
>>>> On 24-Apr-2006 Tim Bunce wrote:
>>>>
>>>>
>>>>> [I'm at the mysql conference and Patrick asked me about adding utf8
>>>>> support to DBD::mysql. I said I'd look at the libmysql docs and
>>>>> give my
>>>>> thoughts. I'm posting to dbi-dev since it may be of interest to
>>>>> others
>>>>> interested in enhancing DBD::mysql and to other driver developers.
>>>>> These are just random thoughts from a quick look at the docs.]
>>>>>
>>>>> The keys mysql docs seem to be
>>>>> http://dev.mysql.com/doc/refman/4.1/en/charset-connection.ht ml
>>>>>
>>>>> The mysql api and client->server protocol doesn't support passing
>>>>> characterset info to the server on a per-statement / per-bind
>>>>> value basis.
>>>>> (http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-state ment-datatypes.ht
>>>>>
>>>>> m
>>>>> l)
>>>>> So the sane way to send utf8 to the server is by setting the
>>>>> 'connection
>>>>> character set' to utf8 and then only sending utf8 (or its ASCII
>>>>> subset)
>>>>> to the server on that connection.
>>>>>
>>>>> *** Fetching data:
>>>>>
>>>>> MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD
>>>>> structure.
>>>>> It's the "character set number for the field".
>>>>>
>>>>> So set the UTF8 flag based on that value. Something like:
>>>>> (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>>>>> I couldn't see any docs for the values of the charsetnr field.
>>>>>
>>>>> Also, would be good to enable perl code to access the charsetnr
>>>>> values:
>>>>> $sth->{mysql_charsetnr}->[$i]
>>>>>
>>>>> *** Fetching Metadata:
>>>>>
>>>>> The above is a minimum. It doesn't address metadata like field names
>>>>> ($sth->{NAME}) that might also be in utf8. For that the driver
>>>>> needs to
>>>>> know if the 'connection character set' is currently utf8.
>>>>>
>>>>> (The docs mention mysql->charset but it's not clear if that's part of
>>>>> the public API.)
>>>>>
>>>>> However it's detected, the code needs to end up doing:
>>>>> (...connection charset is utf8...) ? SvUTF8_on(sv) :
>>>>> SvUTF8_off(sv);
>>>>> on the metadata.
>>>>>
>>>>>
>>>>> *** SET NAMES '...'
>>>>>
>>>>> Intercept SET NAMES and call the mysql_set_character_set() API
>>>>> instead.
>>>>> See
>>>>> http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-s et.html
>>>>>
>>>>>
>>>>> *** Detecting Inconsistencies
>>>>>
>>>>> If the connection character set is _not_ utf8 but the application
>>>>> calls
>>>>> the driver with data (or SQL statement) that has the UTF8 flag
>>>>> set, then
>>>>> it could issue a warning. In practice that may be to be too noisy for
>>>>> people that done their own workarounds for utf8 support. If so then
>>>>> they could be changes to level 1 trace messages.
>>>>>
>>>>> If the connection character set _is_ utf8, and the application calls
>>>>> the driver with data (or SQL statement) that does _not_ have the UTF8
>>>>> flag set but _does_ have bytes with the high bit set, then the driver
>>>>> should issue a warning. The checking for high bit set is an extra
>>>>> cost
>>>>> so this should only be enabled if tracing and/or an attribute is set
>>>>> (perhaps called $dbh->{mysql_charset_checks} = 1)
>>>>>
>>>>> Tim.
>>>>>
>>>>
>>
>>
>>
>>

Re: Adding utf8 support to DBD::mysql

am 30.04.2006 23:50:08 von Martin.Evans

Patrick Galbraith wrote:
> Martin J. Evans wrote:
>
> Martin,
>
> Sure, I'll be glad to put in the code. If you can come up with tests,
> that's one thing we need. Whatever you like, and whatever works for you.
> I'll always give you due credit.

Slightly confused as I think I provided tests for the changes I've
submitted.

Was there some changes I omitted tests for? (other than the utf8 hack
I just posted).

I noticed there were to dos in the the TODO file for more tests -
I can try and knock a few of those off during a free moment.

> I also would like to know how to have a test end its while loop if a
> version doesn't support the server it's running against - if you know
> that, that would be fabulous. I came up with a way by querying the
> database, then checking if a string is /^4/ or /^3/ (stored procs), if
> it matches that, set $state = 1. I think there may be a better way!
> (well, that'd be Test::More I think)

I'll look in to it but don't know off the top of my head.
I saw something in the Testing() fn about skipping/ignoring tests.

> Before I roll a dev release, I'd like to make sure I have looked at and
> added all of your patches that you have, as well as have tests for them.

Great - let me know when it is in subversion and I'll take another look.

Martin

> Again, Thank you!
>
> Patrick
>
>> Patrick Galbraith wrote:
>>
>>> Martin J. Evans wrote:
>>>
>>> Martin,
>>>
>>> Thanks much! This is dbdimp.c, right?
>>
>>
>> Yes
>>
>>> I will add this tomorrow (not working today),
>>
>>
>>
>> and I'm stricly not working tomorrow - bank holiday - but it is likely
>> I'll find time to do something.
>>
>>> and test it out. If everything is a go, then I'll roll up a dev
>>> 3.0003_1 releaseAlso, I've mentioned giving you write access to SVN
>>> for DBD::mysql - do you want that? Also, what is your cpan username?
>>
>>
>>
>> Argh, tend to be a submitter/commentor rather than a creator.
>> I've written DBIx::Log4perl but still have not submitted it
>> as I have not yet got a cpan username.
>>
>> If you are happy giving me write access to DBD::mysql I'm happy to
>> contribute although I think you've got most of my stuff now. If
>> all it means is getting a cpan username and that is straight forward
>> I'll do that also.
>>
>> The utf8 patch is very much a quick hack but it you were to submit
>> the other patches to dbd::mysql first I'd happily investigate/test
>> further. The reason I say this is that I'm finding it increasingly
>> difficult to submit patches on top of patches that are not comitted
>> yet.
>>
>> Martin
>>
>>
>>> Kind regards,
>>>
>>> Patrick
>>>
>>>> Shamelessy stolen from Michael Kröll and against a patch on dbd::mysql
>>>> on subversion I've sent to Patrick which is not committed yet:
>>>>
>>>> @@ -2915,6 +2914,9 @@
>>>> if (dbis->debug >= 2)
>>>> PerlIO_printf(DBILOGFP, "st_fetch string data %s\n",
>>>> fbh->data);
>>>> sv_setpvn(sv, fbh->data, fbh->length);
>>>> + if (is_high_bit_set(fbh->data)) {
>>>> + SvUTF8_on(sv);
>>>> + }
>>>> break;
>>>> @@ -2968,6 +2970,10 @@
>>>> { --len; }
>>>> }
>>>> sv_setpvn(sv, col, len);
>>>> + if (is_high_bit_set(col)) {
>>>> + SvUTF8_on(sv);
>>>> + }
>>>> + @@ -3881,3 +3901,11 @@
>>>> return
>>>> sv_2mortal(my_ulonglong2str(mysql_insert_id(&((imp_dbh_t*)im p_dbh)->mysql)));
>>>>
>>>> }
>>>> #endif
>>>> +
>>>> +int is_high_bit_set(val)
>>>> + char *val;
>>>> +{
>>>> + while (*val)
>>>> + if (*val++ & 0x80) return 1;
>>>> + return 0;
>>>> +}
>>>>
>>>>
>>>> would appear to make the previous code (below and without the
>>>> Encode) I posted
>>>> work fine.
>>>>
>>>> I hasten to add I've not tested this much (other than the posted
>>>> test code).
>>>>
>>>> What I'm personally interested in is whether retrieving utf data from
>>>> DBD::mysql and pushing it through JSON, JSON->objToJson works and I
>>>> cannot
>>>> easily test this until Tuesday.
>>>>
>>>> If Patrick can commit my previous changes I'll test this more.
>>>>
>>>> Martin
>>>> --
>>>> Martin J. Evans
>>>> Easysoft Ltd, UK
>>>> http://www.easysoft.com
>>>>
>>>>
>>>> On 30-Apr-2006 Martin J. Evans wrote:
>>>>
>>>>
>>>>> I hope I'm not muddying the waters but dbd::mysql UTF support for
>>>>> returned
>>>>> data (not metadata) seems to be nearly there. I need UTF8 support on
>>>>> at least inserted and returned results-sets although I'm less bothered
>>>>> by UTF8 table/column names. It would seem that if you define your
>>>>> table as
>>>>> using UTF8 then insertion is not a problem but retrieval is. The
>>>>> following code nearly works - it is just the setting of the utf8
>>>>> flag on
>>>>> the returned data that is wrong:
>>>>>
>>>>> #!/usr/bin/perl -w
>>>>> use strict;
>>>>> use DBI qw(:utils);
>>>>> use charnames ':full';
>>>>> use Encode;
>>>>>
>>>>> print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
>>>>> print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
>>>>> my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}"; #
>>>>> smiley
>>>>> print join(" ", unpack("H*", $str)), "\n";
>>>>> print "length(str) = ", length($str), "\n";
>>>>> print "bytes::length(str) = ", bytes::length($str), "\n";
>>>>>
>>>>> print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
>>>>> print "data_string_desc: ", data_string_desc($str),"\n";
>>>>>
>>>>> open OUT, ">uni.out";
>>>>> binmode(OUT, ":utf8");
>>>>> print OUT "$str\n";
>>>>> # data written to uni.out is UTF8
>>>>>
>>>>> my $dbh = DBI->connect("dbi:mysql:test", "xxx", "xxx");
>>>>> # there are posts on dbi-user as to whether both or either of
>>>>> # the following should be set
>>>>> $dbh->do("set character set utf8");
>>>>> $dbh->do("set names utf8");
>>>>> $dbh->do("drop table if exists utf");
>>>>> $dbh->do("create table utf (a char(100)) default charset utf8");
>>>>> my $sth = $dbh->prepare("insert into utf values (?)");
>>>>> $sth->execute($str);
>>>>> $sth = $dbh->prepare("select * from utf");
>>>>> $sth->execute;
>>>>>
>>>>> my @row = $sth->fetchrow_array;
>>>>> print "data_string_desc (after fetch): ",
>>>>> data_string_desc($row[0]),"\n";
>>>>> # the following shows we'e got the right data back
>>>>> # but perl does not know it is utf8
>>>>> print join(" ", unpack("H*", $row[0])), "\n";
>>>>> # turning on utf8 causes the rignt uf8 sequence to be output
>>>>> # and hence sv_utf8_upgrade(sv) will probably work
>>>>> Encode::_utf8_on($row[0]);
>>>>> print "data_string_desc (after fetch): ",
>>>>> data_string_desc($row[0]),"\n";
>>>>>
>>>>> open OUT, ">utf.out";
>>>>> binmode (OUT, ":utf8");
>>>>> print OUT $row[0];
>>>>> close OUT;
>>>>> # data written to utf.out is not UTF8 unless is marked utf8
>>>>>
>>>>> produces:
>>>>>
>>>>> Is utf8::is_utf8 defined: 1
>>>>> Is utf8::valid defined: 1
>>>>> e298ba787878d790d8a7
>>>>> length(str) = 6
>>>>> bytes::length(str) = 10
>>>>> utf8::is_utf8 = 1
>>>>> data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
>>>>> data_string_desc (after fetch): UTF8 off, non-ASCII, 10 characters
>>>>> 10 bytes
>>>>> e298ba787878d790d8a7
>>>>>
>>>>> with utf.out containing:
>>>>>
>>>>> <98>xxx<97><90><98>
>>>>>
>>>>> without that Encode::_utf8_on($row[0]);
>>>>>
>>>>> Michael Kröll (apr-10-2006)
>>>>> posted a change for DBD::db2 which seemed to sort this out for DB2
>>>>> so a similar change could be added to mysql.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> Martin
>>>>> --
>>>>> Martin J. Evans
>>>>> Easysoft Ltd, UK
>>>>> http://www.easysoft.com
>>>>>
>>>>>
>>>>> On 24-Apr-2006 Tim Bunce wrote:
>>>>>
>>>>>
>>>>>> [I'm at the mysql conference and Patrick asked me about adding utf8
>>>>>> support to DBD::mysql. I said I'd look at the libmysql docs and
>>>>>> give my
>>>>>> thoughts. I'm posting to dbi-dev since it may be of interest to
>>>>>> others
>>>>>> interested in enhancing DBD::mysql and to other driver developers.
>>>>>> These are just random thoughts from a quick look at the docs.]
>>>>>>
>>>>>> The keys mysql docs seem to be
>>>>>> http://dev.mysql.com/doc/refman/4.1/en/charset-connection.ht ml
>>>>>>
>>>>>> The mysql api and client->server protocol doesn't support passing
>>>>>> characterset info to the server on a per-statement / per-bind
>>>>>> value basis.
>>>>>> (http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-state ment-datatypes.ht
>>>>>>
>>>>>> m
>>>>>> l)
>>>>>> So the sane way to send utf8 to the server is by setting the
>>>>>> 'connection
>>>>>> character set' to utf8 and then only sending utf8 (or its ASCII
>>>>>> subset)
>>>>>> to the server on that connection.
>>>>>>
>>>>>> *** Fetching data:
>>>>>>
>>>>>> MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD
>>>>>> structure.
>>>>>> It's the "character set number for the field".
>>>>>>
>>>>>> So set the UTF8 flag based on that value. Something like:
>>>>>> (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>>>>>> I couldn't see any docs for the values of the charsetnr field.
>>>>>>
>>>>>> Also, would be good to enable perl code to access the charsetnr
>>>>>> values:
>>>>>> $sth->{mysql_charsetnr}->[$i]
>>>>>>
>>>>>> *** Fetching Metadata:
>>>>>>
>>>>>> The above is a minimum. It doesn't address metadata like field names
>>>>>> ($sth->{NAME}) that might also be in utf8. For that the driver
>>>>>> needs to
>>>>>> know if the 'connection character set' is currently utf8.
>>>>>>
>>>>>> (The docs mention mysql->charset but it's not clear if that's part of
>>>>>> the public API.)
>>>>>>
>>>>>> However it's detected, the code needs to end up doing:
>>>>>> (...connection charset is utf8...) ? SvUTF8_on(sv) :
>>>>>> SvUTF8_off(sv);
>>>>>> on the metadata.
>>>>>>
>>>>>>
>>>>>> *** SET NAMES '...'
>>>>>>
>>>>>> Intercept SET NAMES and call the mysql_set_character_set() API
>>>>>> instead.
>>>>>> See
>>>>>> http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-s et.html
>>>>>>
>>>>>>
>>>>>> *** Detecting Inconsistencies
>>>>>>
>>>>>> If the connection character set is _not_ utf8 but the application
>>>>>> calls
>>>>>> the driver with data (or SQL statement) that has the UTF8 flag
>>>>>> set, then
>>>>>> it could issue a warning. In practice that may be to be too noisy for
>>>>>> people that done their own workarounds for utf8 support. If so then
>>>>>> they could be changes to level 1 trace messages.
>>>>>>
>>>>>> If the connection character set _is_ utf8, and the application calls
>>>>>> the driver with data (or SQL statement) that does _not_ have the UTF8
>>>>>> flag set but _does_ have bytes with the high bit set, then the driver
>>>>>> should issue a warning. The checking for high bit set is an extra
>>>>>> cost
>>>>>> so this should only be enabled if tracing and/or an attribute is set
>>>>>> (perhaps called $dbh->{mysql_charset_checks} = 1)
>>>>>>
>>>>>> Tim.
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>>
>>>
>
>
>

Re: Adding utf8 support to DBD::mysql

am 01.05.2006 00:20:36 von patg

Martin J. Evans wrote:

> Patrick Galbraith wrote:
>
>> Martin J. Evans wrote:
>>
>> Martin,
>>
>> Sure, I'll be glad to put in the code. If you can come up with tests,
>> that's one thing we need. Whatever you like, and whatever works for
>> you. I'll always give you due credit.
>
>
> Slightly confused as I think I provided tests for the changes I've
> submitted.

Martin,

Yes, I know - you've given plenty of tests - I meant in general, that we
could use more tests overall. Sorry for that misunderstanding ;)

>
> Was there some changes I omitted tests for? (other than the utf8 hack
> I just posted).
>
> I noticed there were to dos in the the TODO file for more tests -
> I can try and knock a few of those off during a free moment.
>
>> I also would like to know how to have a test end its while loop if a
>> version doesn't support the server it's running against - if you know
>> that, that would be fabulous. I came up with a way by querying the
>> database, then checking if a string is /^4/ or /^3/ (stored procs),
>> if it matches that, set $state = 1. I think there may be a better
>> way! (well, that'd be Test::More I think)
>
>
> I'll look in to it but don't know off the top of my head.
> I saw something in the Testing() fn about skipping/ignoring tests.
>
>> Before I roll a dev release, I'd like to make sure I have looked at
>> and added all of your patches that you have, as well as have tests
>> for them.
>
>
> Great - let me know when it is in subversion and I'll take another look.


Thanks again! Sorry if you took my last mail to main that you didn't
give tests!

Kind regards,

Patrick

>
> Martin
>
>> Again, Thank you!
>>
>> Patrick
>>
>>> Patrick Galbraith wrote:
>>>
>>>> Martin J. Evans wrote:
>>>>
>>>> Martin,
>>>>
>>>> Thanks much! This is dbdimp.c, right?
>>>
>>>
>>>
>>> Yes
>>>
>>>> I will add this tomorrow (not working today),
>>>
>>>
>>>
>>>
>>> and I'm stricly not working tomorrow - bank holiday - but it is likely
>>> I'll find time to do something.
>>>
>>>> and test it out. If everything is a go, then I'll roll up a dev
>>>> 3.0003_1 releaseAlso, I've mentioned giving you write access to SVN
>>>> for DBD::mysql - do you want that? Also, what is your cpan username?
>>>
>>>
>>>
>>>
>>> Argh, tend to be a submitter/commentor rather than a creator.
>>> I've written DBIx::Log4perl but still have not submitted it
>>> as I have not yet got a cpan username.
>>>
>>> If you are happy giving me write access to DBD::mysql I'm happy to
>>> contribute although I think you've got most of my stuff now. If
>>> all it means is getting a cpan username and that is straight forward
>>> I'll do that also.
>>>
>>> The utf8 patch is very much a quick hack but it you were to submit
>>> the other patches to dbd::mysql first I'd happily investigate/test
>>> further. The reason I say this is that I'm finding it increasingly
>>> difficult to submit patches on top of patches that are not comitted
>>> yet.
>>>
>>> Martin
>>>
>>>
>>>> Kind regards,
>>>>
>>>> Patrick
>>>>
>>>>> Shamelessy stolen from Michael Kröll and against a patch on
>>>>> dbd::mysql
>>>>> on subversion I've sent to Patrick which is not committed yet:
>>>>>
>>>>> @@ -2915,6 +2914,9 @@
>>>>> if (dbis->debug >= 2)
>>>>> PerlIO_printf(DBILOGFP, "st_fetch string data %s\n",
>>>>> fbh->data);
>>>>> sv_setpvn(sv, fbh->data, fbh->length);
>>>>> + if (is_high_bit_set(fbh->data)) {
>>>>> + SvUTF8_on(sv);
>>>>> + }
>>>>> break;
>>>>> @@ -2968,6 +2970,10 @@
>>>>> { --len; }
>>>>> }
>>>>> sv_setpvn(sv, col, len);
>>>>> + if (is_high_bit_set(col)) {
>>>>> + SvUTF8_on(sv);
>>>>> + }
>>>>> + @@ -3881,3 +3901,11 @@
>>>>> return
>>>>> sv_2mortal(my_ulonglong2str(mysql_insert_id(&((imp_dbh_t*)im p_dbh)->mysql)));
>>>>>
>>>>> }
>>>>> #endif
>>>>> +
>>>>> +int is_high_bit_set(val)
>>>>> + char *val;
>>>>> +{
>>>>> + while (*val)
>>>>> + if (*val++ & 0x80) return 1;
>>>>> + return 0;
>>>>> +}
>>>>>
>>>>>
>>>>> would appear to make the previous code (below and without the
>>>>> Encode) I posted
>>>>> work fine.
>>>>>
>>>>> I hasten to add I've not tested this much (other than the posted
>>>>> test code).
>>>>>
>>>>> What I'm personally interested in is whether retrieving utf data from
>>>>> DBD::mysql and pushing it through JSON, JSON->objToJson works and
>>>>> I cannot
>>>>> easily test this until Tuesday.
>>>>>
>>>>> If Patrick can commit my previous changes I'll test this more.
>>>>>
>>>>> Martin
>>>>> --
>>>>> Martin J. Evans
>>>>> Easysoft Ltd, UK
>>>>> http://www.easysoft.com
>>>>>
>>>>>
>>>>> On 30-Apr-2006 Martin J. Evans wrote:
>>>>>
>>>>>
>>>>>> I hope I'm not muddying the waters but dbd::mysql UTF support for
>>>>>> returned
>>>>>> data (not metadata) seems to be nearly there. I need UTF8 support on
>>>>>> at least inserted and returned results-sets although I'm less
>>>>>> bothered
>>>>>> by UTF8 table/column names. It would seem that if you define your
>>>>>> table as
>>>>>> using UTF8 then insertion is not a problem but retrieval is. The
>>>>>> following code nearly works - it is just the setting of the utf8
>>>>>> flag on
>>>>>> the returned data that is wrong:
>>>>>>
>>>>>> #!/usr/bin/perl -w
>>>>>> use strict;
>>>>>> use DBI qw(:utils);
>>>>>> use charnames ':full';
>>>>>> use Encode;
>>>>>>
>>>>>> print "Is utf8::is_utf8 defined: ", defined &utf8::is_utf8, "\n";
>>>>>> print "Is utf8::valid defined: ", defined &utf8::valid, "\n";
>>>>>> my $str = "\x{263a}xxx" . chr(0x05d0) . "\N{ARABIC LETTER ALEF}";
>>>>>> # smiley
>>>>>> print join(" ", unpack("H*", $str)), "\n";
>>>>>> print "length(str) = ", length($str), "\n";
>>>>>> print "bytes::length(str) = ", bytes::length($str), "\n";
>>>>>>
>>>>>> print "utf8::is_utf8 = ", utf8::is_utf8($str) ? 1 : 0, "\n";
>>>>>> print "data_string_desc: ", data_string_desc($str),"\n";
>>>>>>
>>>>>> open OUT, ">uni.out";
>>>>>> binmode(OUT, ":utf8");
>>>>>> print OUT "$str\n";
>>>>>> # data written to uni.out is UTF8
>>>>>>
>>>>>> my $dbh = DBI->connect("dbi:mysql:test", "xxx", "xxx");
>>>>>> # there are posts on dbi-user as to whether both or either of
>>>>>> # the following should be set
>>>>>> $dbh->do("set character set utf8");
>>>>>> $dbh->do("set names utf8");
>>>>>> $dbh->do("drop table if exists utf");
>>>>>> $dbh->do("create table utf (a char(100)) default charset utf8");
>>>>>> my $sth = $dbh->prepare("insert into utf values (?)");
>>>>>> $sth->execute($str);
>>>>>> $sth = $dbh->prepare("select * from utf");
>>>>>> $sth->execute;
>>>>>>
>>>>>> my @row = $sth->fetchrow_array;
>>>>>> print "data_string_desc (after fetch): ",
>>>>>> data_string_desc($row[0]),"\n";
>>>>>> # the following shows we'e got the right data back
>>>>>> # but perl does not know it is utf8
>>>>>> print join(" ", unpack("H*", $row[0])), "\n";
>>>>>> # turning on utf8 causes the rignt uf8 sequence to be output
>>>>>> # and hence sv_utf8_upgrade(sv) will probably work
>>>>>> Encode::_utf8_on($row[0]);
>>>>>> print "data_string_desc (after fetch): ",
>>>>>> data_string_desc($row[0]),"\n";
>>>>>>
>>>>>> open OUT, ">utf.out";
>>>>>> binmode (OUT, ":utf8");
>>>>>> print OUT $row[0];
>>>>>> close OUT;
>>>>>> # data written to utf.out is not UTF8 unless is marked utf8
>>>>>>
>>>>>> produces:
>>>>>>
>>>>>> Is utf8::is_utf8 defined: 1
>>>>>> Is utf8::valid defined: 1
>>>>>> e298ba787878d790d8a7
>>>>>> length(str) = 6
>>>>>> bytes::length(str) = 10
>>>>>> utf8::is_utf8 = 1
>>>>>> data_string_desc: UTF8 on, non-ASCII, 6 characters 10 bytes
>>>>>> data_string_desc (after fetch): UTF8 off, non-ASCII, 10
>>>>>> characters 10 bytes
>>>>>> e298ba787878d790d8a7
>>>>>>
>>>>>> with utf.out containing:
>>>>>>
>>>>>> <98>xxx<97><90><98>
>>>>>>
>>>>>> without that Encode::_utf8_on($row[0]);
>>>>>>
>>>>>> Michael Kröll (apr-10-2006)
>>>>>> posted a change for DBD::db2 which seemed to sort this out for DB2
>>>>>> so a similar change could be added to mysql.
>>>>>>
>>>>>> Hope this helps.
>>>>>>
>>>>>> Martin
>>>>>> --
>>>>>> Martin J. Evans
>>>>>> Easysoft Ltd, UK
>>>>>> http://www.easysoft.com
>>>>>>
>>>>>>
>>>>>> On 24-Apr-2006 Tim Bunce wrote:
>>>>>>
>>>>>>
>>>>>>> [I'm at the mysql conference and Patrick asked me about adding utf8
>>>>>>> support to DBD::mysql. I said I'd look at the libmysql docs and
>>>>>>> give my
>>>>>>> thoughts. I'm posting to dbi-dev since it may be of interest to
>>>>>>> others
>>>>>>> interested in enhancing DBD::mysql and to other driver developers.
>>>>>>> These are just random thoughts from a quick look at the docs.]
>>>>>>>
>>>>>>> The keys mysql docs seem to be
>>>>>>> http://dev.mysql.com/doc/refman/4.1/en/charset-connection.ht ml
>>>>>>>
>>>>>>> The mysql api and client->server protocol doesn't support passing
>>>>>>> characterset info to the server on a per-statement / per-bind
>>>>>>> value basis.
>>>>>>> (http://dev.mysql.com/doc/refman/4.1/en/c-api-prepared-state ment-datatypes.ht
>>>>>>>
>>>>>>> m
>>>>>>> l)
>>>>>>> So the sane way to send utf8 to the server is by setting the
>>>>>>> 'connection
>>>>>>> character set' to utf8 and then only sending utf8 (or its ASCII
>>>>>>> subset)
>>>>>>> to the server on that connection.
>>>>>>>
>>>>>>> *** Fetching data:
>>>>>>>
>>>>>>> MySQL 4.1.0 added "unsigned int charsetnr" to the MYSQL_FIELD
>>>>>>> structure.
>>>>>>> It's the "character set number for the field".
>>>>>>>
>>>>>>> So set the UTF8 flag based on that value. Something like:
>>>>>>> (field->charsetnr = ???) ? SvUTF8_on(sv) : SvUTF8_off(sv);
>>>>>>> I couldn't see any docs for the values of the charsetnr field.
>>>>>>>
>>>>>>> Also, would be good to enable perl code to access the charsetnr
>>>>>>> values:
>>>>>>> $sth->{mysql_charsetnr}->[$i]
>>>>>>>
>>>>>>> *** Fetching Metadata:
>>>>>>>
>>>>>>> The above is a minimum. It doesn't address metadata like field
>>>>>>> names
>>>>>>> ($sth->{NAME}) that might also be in utf8. For that the driver
>>>>>>> needs to
>>>>>>> know if the 'connection character set' is currently utf8.
>>>>>>>
>>>>>>> (The docs mention mysql->charset but it's not clear if that's
>>>>>>> part of
>>>>>>> the public API.)
>>>>>>>
>>>>>>> However it's detected, the code needs to end up doing:
>>>>>>> (...connection charset is utf8...) ? SvUTF8_on(sv) :
>>>>>>> SvUTF8_off(sv);
>>>>>>> on the metadata.
>>>>>>>
>>>>>>>
>>>>>>> *** SET NAMES '...'
>>>>>>>
>>>>>>> Intercept SET NAMES and call the mysql_set_character_set() API
>>>>>>> instead.
>>>>>>> See
>>>>>>> http://dev.mysql.com/doc/refman/4.1/en/mysql-set-character-s et.html
>>>>>>>
>>>>>>>
>>>>>>> *** Detecting Inconsistencies
>>>>>>>
>>>>>>> If the connection character set is _not_ utf8 but the
>>>>>>> application calls
>>>>>>> the driver with data (or SQL statement) that has the UTF8 flag
>>>>>>> set, then
>>>>>>> it could issue a warning. In practice that may be to be too
>>>>>>> noisy for
>>>>>>> people that done their own workarounds for utf8 support. If so then
>>>>>>> they could be changes to level 1 trace messages.
>>>>>>>
>>>>>>> If the connection character set _is_ utf8, and the application
>>>>>>> calls
>>>>>>> the driver with data (or SQL statement) that does _not_ have the
>>>>>>> UTF8
>>>>>>> flag set but _does_ have bytes with the high bit set, then the
>>>>>>> driver
>>>>>>> should issue a warning. The checking for high bit set is an
>>>>>>> extra cost
>>>>>>> so this should only be enabled if tracing and/or an attribute is
>>>>>>> set
>>>>>>> (perhaps called $dbh->{mysql_charset_checks} = 1)
>>>>>>>
>>>>>>> Tim.
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>>
>>
>>
>>