What to do with UTF-8 data?

am 10.09.2003 17:51:14 von Steve Hay

Hi,

I have a question regarding the handling of UTF-8 data using the
DBD-mysql driver.

I'm running Perl 5.8.0 with MySQL 3.23.56 (via DBD-mysql). Since (I
think) there is no native UTF-8 support in MySQL below 4.1.x, my plan
was to simply store the bytes of each UTF-8 character in the database
(converting Perl's UTF-8 strings to sequences of octets using
Encode::encode_utf8()), and then convert such octet sequences back to
UTF-8 strings when retrieving data from the database using
Encode::decode_utf8().

As long as I store *all* my data in this way, those conversions should
always succeed. (encode_utf8() never fails anyway, and decode_utf8()
will always work here because I'm always feeding it "valid" data.)

The problem is: How do I trap all input/output to/from DBI to do these
conversions?

I can easily do it "manually":

$dbh->do('INSERT INTO foo (bar) VALUES (?)', undef,
Encode::encode_utf8($input_utf8str));
...
my @octets_row = $dbh->selectrow_array('SELECT bar FROM foo');
my $output_utf8str2 = Encode::decode_utf8($octets_row[0]);

but that's way too tedious in practice. I want to have those
conversions done for me automatically.

I've asked about this on the dbi-users mailing list, and the answer
(from Tim Bunce, no less) was that it is really the responsibility of
the DBD driver to perform such conversions if the data in question is UTF-8.

He noted that MySQL 4.1.x provides information about the character set
of a given column, so DBD-mysql should be able to use that.

For those of us not wanting to take the plunge just yet with what is,
after all, still alpha software, he suggested that a driver private
option could be used to indicate that all char fields are UTF-8, or else
have some way of indicating that per-column, such as

$sth->bind_col(1, undef, { mysql_charset => 'utf8' });

Is there any such functionality currently in DBD-mysql, or any chance of
it or something similar being added soonish?

I could really use some means of telling DBD-mysql to do the conversions
outlined above on all my char data.

Thanks,
- Steve

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: What to do with UTF-8 data?

am 11.09.2003 12:49:13 von Tim Bunce

On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote:
> Hi, Steve,
>
> > The problem is: How do I trap all input/output to/from DBI to do these
> > conversions?
>
> > I've asked about this on the dbi-users mailing list, and the answer
> > (from Tim Bunce, no less) was that it is really the responsibility of
> > the DBD driver to perform such conversions if the data in question is UTF-8.

That's not quite right. I wasn't talking about any _conversions_ at all.

> after letting my thoughts settle I come to the conclusion that I do not
> agree completely. I think that DBI should do 80% of the job and leave
> about 20% to the driver authors.

For a "full solution" yes, I agree - and I've written about this in the past.

For now I'm just talking about the specific but fairly common
situation of fetching data that is utf8 encoded but it doesn't
get flagged as such by the driver.

For that case the driver just needs to know when to do a SvUTF8_on(sv).

Tim.

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: What to do with UTF-8 data?

am 11.09.2003 12:49:13 von Tim Bunce

Re: What to do with UTF-8 data?

am 11.09.2003 13:31:52 von Steve Hay

Tim Bunce wrote:

>On Thu, Sep 11, 2003 at 08:29:50AM +0200, Jochen Wiedmann wrote:
> =20
>
>>Hi, Steve,
>>
>> =20
>>
>>>The problem is: How do I trap all input/output to/from DBI to do these=
=20
>>>conversions?
>>> =20
>>>
>>>I've asked about this on the dbi-users mailing list, and the answer=20
>>>(from Tim Bunce, no less) was that it is really the responsibility of=20
>>>the DBD driver to perform such conversions if the data in question is =
UTF-8.
>>> =20
>>>
>
>That's not quite right. I wasn't talking about any _conversions_ at all.
>
I'm sorry if I mis-quoted you. I meant setting the UTF-8 flag on an=20
octet sequence that can be interpreted as UTF-8, rather than leaving it=20
unflagged and treated as Latin-1. Thus, the data is in some sense=20
"converted" from Latin-1 to UTF-8.

>
> =20
>
>>after letting my thoughts settle I come to the conclusion that I do not
>>agree completely. I think that DBI should do 80% of the job and leave
>>about 20% to the driver authors.
>> =20
>>
>
>For a "full solution" yes, I agree - and I've written about this in the =
past.
>
>For now I'm just talking about the specific but fairly common
>situation of fetching data that is utf8 encoded but it doesn't
>get flagged as such by the driver.
>
>For that case the driver just needs to know when to do a SvUTF8_on(sv).
>
Exactly.

What about data going _into_ the database? In my examples of doing the=20
conversion manually with Encode::{en|de}code_utf8(), I was converting=20
the Perl strings to octet sequences that could later be interpreted as=20
UTF-8 before insertion into the database. That way I could guarantee=20
that all data retrieved from the database can be converted to UTF-8, in=20
fact (as you pointed out) by simply turning the UTF-8 flag on.

If all the data that I insert really is UTF-8 then I guess it will just=20
get "serialised" as a sequence of octets, and everything will be OK.

But what if the data I'm inserting isn't all UTF-8? The problem is:

1. Perl's internal format isn't just UTF-8 -- it defaults to Latin-1 (or=20
whatever) for strings in which every character can be represented in=20
Latin-1;
2. The "8-bit" characters of Latin-1 are represented as two-byte=20
characters in UTF-8.

So, if I have the string "Copyright =A9 Fred Bloggs" in Perl then it will=
=20
not be UTF-8: the =A9 is stored as one byte, not two not, and the UTF-8=20
flag is off. If I insert that straight into the database without=20
running it through Encode::encode_utf8() first, then =A9 itself, rather=20
than its two-byte UTF-8 representation gets stored in the database, so=20
when it gets retrieved from the database later you can't just turn the=20
UTF-8 flag on -- you would need to run it through Encode::decode_utf8().

In other words, just having the driver switching the UTF-8 on and off=20
will only work if I guarantee that all the strings I feed it to start=20
with really are UTF-8, even when Perl would not normally have=20
represented them as such.

It would be cool if something akin to "binmode STDOUT, ':utf8';" could=20
be applied when sending data to the driver -- i.e. my data is in "Perl's=20
internal format", whether that be Latin-1 or UTF-8 in the case of the=20
string at hand, and it all gets automagically upgraded to UTF-8 if=20
necessary before insertion into the database. Then you only need to=20
turn the flag on when retrieving it again.

At least, I think that's what I want :-s

- Steve

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=3Dgcdmp-msql-mysql-modules @m.gmane.org

Re: What to do with UTF-8 data?

am 11.09.2003 13:31:52 von Steve Hay

Re: What to do with UTF-8 data?

am 11.09.2003 16:25:56 von Bart Lateur

On Thu, 11 Sep 2003 12:31:52 +0100, Steve Hay wrote:

>It would be cool if something akin to "binmode STDOUT, ':utf8';" could
>be applied when sending data to the driver -- i.e. my data is in "Perl's
>internal format", whether that be Latin-1 or UTF-8 in the case of the
>string at hand, and it all gets automagically upgraded to UTF-8 if
>necessary before insertion into the database.

Oh that's easy to achieve. Just concatenate the string with an UTF-8
string, and you'll get an UTF-8 string. Perl will do the upgrading for
you.

Just try it:

$zero_length_utf8 = pack "U0"; # UTF8, length ==0
$string = "élève"; # Latin-1
$string .= $zero_length_utf8; # upgrade to UTF8
print $string;

Now the reverse is much harder... :)

--
Bart.

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: What to do with UTF-8 data?

am 11.09.2003 16:25:56 von Bart Lateur

Re: What to do with UTF-8 data?

am 11.09.2003 17:56:52 von Rudy Lippan

On Wed, 10 Sep 2003, Steve Hay wrote:

>
> Is there any such functionality currently in DBD-mysql, or any chance of
> it or something similar being added soonish?
>

Not right now, but I expect this to go into the next non-bugfix version
which will have more support for the 4.1.x features; and I don't expect it
to be that much longer before a release.

> I could really use some means of telling DBD-mysql to do the conversions
> outlined above on all my char data.
>

You might be able to use some of the Class::DBI features to get this write
now; or, if you would like (and don't mind using pre-alpa code), I could
hack up a patch this weekend that would give you a mysql_enable_utf8 flag
that if enabled will try and guess the types (as DBD::Pg does right now).

Rudy

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: What to do with UTF-8 data?

am 11.09.2003 17:56:52 von Rudy Lippan

Re: What to do with UTF-8 data?

am 11.09.2003 19:10:19 von Steve Hay

Rudy Lippan wrote:

>On Wed, 10 Sep 2003, Steve Hay wrote:
>
>
>
>>Is there any such functionality currently in DBD-mysql, or any chance of
>>it or something similar being added soonish?
>>
>>
>>
>
>Not right now, but I expect this to go into the next non-bugfix version
>which will have more support for the 4.1.x features; and I don't expect it
>to be that much longer before a release.
>
Sadly, I'm only using MySQL 3.23.56. I could/should upgrade to 4.0, but
4.1 is only alpha, so I don't want to go there just yet.

>
>
>
>>I could really use some means of telling DBD-mysql to do the conversions
>>outlined above on all my char data.
>>
>>
>>
>
>You might be able to use some of the Class::DBI features to get this write
>now;
>
I am using Class::DBI, actually, but I couldn't figure out a way to do
this with it. There is a select trigger, but the before_set_* triggers
are per-column, which I'm too lazy to set up.

>or, if you would like (and don't mind using pre-alpa code), I could
>hack up a patch this weekend that would give you a mysql_enable_utf8 flag
>that if enabled will try and guess the types (as DBD::Pg does right now).
>
I like the idea that Tim Bunce suggested the other day, namely:

"a driver private option could be used to indicate that all char fields
are UTF-8, or else have some way of indicating that per-column, such as
$sth->bind_col(1, undef, { mysql_charset => 'utf8' });"

(In fact, he later improved the latter part of that idea to: "something
like ..., { mysql_is_utf8 => 1 });").

I would be particularly interested in the "all char fields are UTF-8" bit.

If you could do something along those lines (for MySQL 3.23) that'd be
great.

(BTW, There has been further discussion about all this on the dbi-users
list since I brought my question over to this list. Take a look at
that, if you haven't seen it already, before you do anything!)

Cheers,
- Steve

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: What to do with UTF-8 data?

am 11.09.2003 19:10:19 von Steve Hay

Re: What to do with UTF-8 data?

am 11.09.2003 19:37:09 von Rudy Lippan

On Thu, 11 Sep 2003, Steve Hay wrote:

> Rudy Lippan wrote:
> >On Wed, 10 Sep 2003, Steve Hay wrote:
> >
> >>
> >
> >You might be able to use some of the Class::DBI features to get this write
> >now;
> >
> I am using Class::DBI, actually, but I couldn't figure out a way to do
> this with it. There is a select trigger, but the before_set_* triggers
> are per-column, which I'm too lazy to set up.
>

:-)

> >or, if you would like (and don't mind using pre-alpa code), I could
> >hack up a patch this weekend that would give you a mysql_enable_utf8 flag
> >that if enabled will try and guess the types (as DBD::Pg does right now).
> >
> I like the idea that Tim Bunce suggested the other day, namely:
>
> "a driver private option could be used to indicate that all char fields
> are UTF-8, or else have some way of indicating that per-column, such as
> $sth->bind_col(1, undef, { mysql_charset => 'utf8' });"
>
> (In fact, he later improved the latter part of that idea to: "something
> like ..., { mysql_is_utf8 => 1 });").
>
> I would be particularly interested in the "all char fields are UTF-8" bit.
>
> If you could do something along those lines (for MySQL 3.23) that'd be
> great.
>

I was thinking something along the way we went with this for DBD::Pg.
Take a look at the docs for the latest dev version and let me know what
you think -- It is on CPAN.

> (BTW, There has been further discussion about all this on the dbi-users
> list since I brought my question over to this list. Take a look at
> that, if you haven't seen it already, before you do anything!)

I know... I am just a few days behind on my dbi-users.

Rudy

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: What to do with UTF-8 data?

am 11.09.2003 19:37:09 von Rudy Lippan

Re: What to do with UTF-8 data?

am 12.09.2003 09:58:22 von Steve Hay

Rudy Lippan wrote:

>On Thu, 11 Sep 2003, Steve Hay wrote:
> =20
>
>>>or, if you would like (and don't mind using pre-alpa code), I could
>>>hack up a patch this weekend that would give you a mysql_enable_utf8 f=
lag
>>>that if enabled will try and guess the types (as DBD::Pg does right no=
w).
>>>
>>> =20
>>>
>>I like the idea that Tim Bunce suggested the other day, namely:
>>
>>"a driver private option could be used to indicate that all char fields=
=20
>>are UTF-8, or else have some way of indicating that per-column, such as=
=20
>>$sth->bind_col(1, undef, { mysql_charset =3D> 'utf8' });"
>>
>>(In fact, he later improved the latter part of that idea to: "something=
=20
>>like ..., { mysql_is_utf8 =3D> 1 });").
>>
>>I would be particularly interested in the "all char fields are UTF-8" b=
it.
>>
>>If you could do something along those lines (for MySQL 3.23) that'd be=20
>>great.
>>
>> =20
>>
>
>I was thinking something along the way we went with this for DBD::Pg. =20
>Take a look at the docs for the latest dev version and let me know what=20
>you think -- It is on CPAN.
>
I've had a look at DBD-Pg-1.31_5 and I see the database handle attribute=20
pg_enable_utf8 (Boolean) - "If true, then the utf8 flag will be turned=20
for returned character data (if the data is valid utf8)."

I think a corresponding mysql_enable_utf8 flag for DBD-mysql would be=20
ideal for me.

Unless I'm misunderstanding something, this is exactly what Tim's first=20
proposal above was -- the "all chars fields are UTF-8" bit -- except=20
that it's clearer what it will do when it encounters "invalid" (i.e.=20
not-UTF-8) data, namely, it just leaves it unflagged.

That sounds excellent because it removes the problem that I was rambling=20
on about on dbi-users yesterday [see=20
http://www.xray.mpe.mpg.de/mailing-lists/dbi/2003-09/msg0014 4.html]: If=20
you insert "=A9" into the database and then set the UTF-8 flag when=20
retrieving it you're in trouble, because that's not UTF-8. You need to=20
arrange for all data going into the database to be converted to UTF-8 so=20
that you can be sure that all you need to do when retrieving it later is=20
turn on the UTF-8 flag. Now, with your subtely more flexible flag, you=20
don't need to worry: If you insert "=A9" into the database then that's=20
exactly what you get back (i.e. not flagged as UTF-8), and if you=20
convert "=A9" to UTF-8 using Encode::encode_utf8() before inserting it,=20
then what you get back is also flagged UTF-8. Splendid.

Future enhancements might include the option to have the flag force the=20
more "strict" behaviour (i.e. it requires that all the data retrieved=20
_is_ UTF-8 and complains if it isn't), and to make these flags available=20
on a per-column basis as well, as Tim suggested.

But I don't need either of those enhancements just now. Your=20
mysql_enable_utf8 flag will do nicely.

I look forward to a release with that in it!

Thanks,
- Steve

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=3Dgcdmp-msql-mysql-modules @m.gmane.org

Re: What to do with UTF-8 data?

am 12.09.2003 09:58:22 von Steve Hay

Re: What to do with UTF-8 data?

am 06.10.2003 12:16:18 von Steve Hay

Hi Rudy,

Did you get round to looking at adding a mysql_enable_utf8 flag to=20
DBD::mysql yet?

- Steve

Steve Hay wrote:

> Rudy Lippan wrote:
>
>> I was thinking something along the way we went with this for=20
>> DBD::Pg. Take a look at the docs for the latest dev version and let=20
>> me know what you think -- It is on CPAN.
>>
> I've had a look at DBD-Pg-1.31_5 and I see the database handle=20
> attribute pg_enable_utf8 (Boolean) - "If true, then the utf8 flag will=20
> be turned for returned character data (if the data is valid utf8)."
>
> I think a corresponding mysql_enable_utf8 flag for DBD-mysql would be=20
> ideal for me.
>
> Unless I'm misunderstanding something, this is exactly what Tim's=20
> first proposal above was -- the "all chars fields are UTF-8" bit --=20
> except that it's clearer what it will do when it encounters "invalid"=20
> (i.e. not-UTF-8) data, namely, it just leaves it unflagged.
>
> That sounds excellent because it removes the problem that I was=20
> rambling on about on dbi-users yesterday [see=20
> http://www.xray.mpe.mpg.de/mailing-lists/dbi/2003-09/msg0014 4.html]:=20
> If you insert "=A9" into the database and then set the UTF-8 flag when=20
> retrieving it you're in trouble, because that's not UTF-8. You need=20
> to arrange for all data going into the database to be converted to=20
> UTF-8 so that you can be sure that all you need to do when retrieving=20
> it later is turn on the UTF-8 flag. Now, with your subtely more=20
> flexible flag, you don't need to worry: If you insert "=A9" into the=20
> database then that's exactly what you get back (i.e. not flagged as=20
> UTF-8), and if you convert "=A9" to UTF-8 using Encode::encode_utf8()=20
> before inserting it, then what you get back is also flagged UTF-8. =20
> Splendid.
>
> Future enhancements might include the option to have the flag force=20
> the more "strict" behaviour (i.e. it requires that all the data=20
> retrieved _is_ UTF-8 and complains if it isn't), and to make these=20
> flags available on a per-column basis as well, as Tim suggested.
>
> But I don't need either of those enhancements just now. Your=20
> mysql_enable_utf8 flag will do nicely.
>
> I look forward to a release with that in it!
>
> Thanks,
> - Steve

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=3Dgcdmp-msql-mysql-modules @m.gmane.org

Re: What to do with UTF-8 data?

am 06.10.2003 12:16:18 von Steve Hay