blessing db data as utf8

am 09.06.2004 15:01:09 von Gaal Yahas

Hello,

My data is stored on a mysql 4.0.20 server, in utf8. The database doesn't
know it is utf8; as far as I could tell this version doesn't have full
support for setting unicode charsets yet (please correct me if I'm wrong).

The problem is that when I fetch the data with DBD::mysql, perl doesn't
mark it as utf8, resulting in garbage data and forcing me to use workarounds
such as calling Encode::_utf8_on on my strings.

I'm unsure about the best place to do this, but it turns out that DBD::Pg
has addressed the same problem for Postgres as a global perl-dbh switch.

What do the maintainers of DBD::mysql say? Should the same style of fix
be added to DBD::mysql? I'm willing to work on a patch if nobody else steps
forward.

Gaal

PS: The forwarded message below is from the Class::DBI-talk mailing list.

----- Forwarded message from Dominic Mitchell -----

Message-ID: <40C6FBC2.8070901@semantico.com>
Date: Wed, 09 Jun 2004 13:00:02 +0100
From: Dominic Mitchell

Gaal Yahas wrote:
>On Wed, Jun 09, 2004 at 01:23:49PM +0200, Andreas Fromm wrote:
>
>>>The problem is that the database doesn't know better: as far as it is
>>>concerned, the data is (say) latin1. This is true at least for mysql
>>>4.0.20
>>>which I have been using. This means that the metadata about encoding type
>>>of a table or a column can't come from the database, even though ideally
>>>it should. I don't know DBIx::ContextualFetch to say, but DBI seems at the
>>>moment to be too low-level for this kind of knowledge.
>>>
>>>That said, *my* data is all utf8, so I don't mind a global switch :)
>>>
>>
>>What abaut PostgerSQL where you tell the server at database-creation how
>>to encode the Data? When I create a db with unicode encoding, it
>>_should_ know abaut encoding, doesn't it?
>
>I don't have pg to test this with, but it seems you are more lucky:
>and that in that database's case, perhaps the right place to fix this
>would be DBD::Pg--or maybe Class::DBI::Pg?

DBD::Pg can already support utf8. I patched it last year, and versions
1.22 onwards support a $dbh->{pg_enable_utf8} attribute. It's a bit
kludgy (ignores database encoding) and I don't know direction Unicode
support in DBI is going in generally, so it may change in the future.
But this works now.

-Dom

--
| Semantico: creators of major online resources |
| URL: http://www.semantico.com/ |
| Tel: +44 (1273) 722222 / Fax: +44 (1273) 723232 |
| Address: 33 Bond St., Brighton, Sussex, BN1 1RD, UK. |

----- End forwarded message -----

--
Gaal Yahas
http://gaal.livejournal.com/

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: blessing db data as utf8

am 09.06.2004 16:05:51 von Jochen Wiedmann

On Mi, 2004-06-09 at 15:01, Gaal Yahas wrote:

> I'm unsure about the best place to do this, but it turns out that DBD::Pg
> has addressed the same problem for Postgres as a global perl-dbh switch.

Just one comment: A dbh property seems more sensible to me. Perhaps even
inherited by the sth. As far as I know it is possible to choose the
encoding by table. In other words: It might become important to read
different encodings within one transaction.

Jochen

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: blessing db data as utf8

am 09.06.2004 16:05:51 von Jochen Wiedmann

Re: blessing db data as utf8

am 09.06.2004 16:38:33 von Gaal Yahas

On Wed, Jun 09, 2004 at 04:05:51PM +0200, Jochen Wiedmann wrote:
> > I'm unsure about the best place to do this, but it turns out that DBD::Pg
> > has addressed the same problem for Postgres as a global perl-dbh switch.
>
> Just one comment: A dbh property seems more sensible to me. Perhaps even
> inherited by the sth. As far as I know it is possible to choose the
> encoding by table. In other words: It might become important to read
> different encodings within one transaction.

Oops, I meant "global per-dbh" switch. (Anyone know how to turn off
auto-complete in the fingers?) For a first patch, would you find it
agreeable to have dbh level control? That's what I need pretty urgently,
anyway, and I'm surprised people haven't been asking for it already...

--
Gaal Yahas
http://gaal.livejournal.com/

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: blessing db data as utf8

am 09.06.2004 16:38:33 von Gaal Yahas

Re: blessing db data as utf8

am 09.06.2004 16:45:30 von Jochen Wiedmann

On Mi, 2004-06-09 at 16:38, Gaal Yahas wrote:

> Oops, I meant "global per-dbh" switch. (Anyone know how to turn off
> auto-complete in the fingers?) For a first patch, would you find it
> agreeable to have dbh level control? That's what I need pretty urgently,
> anyway, and I'm surprised people haven't been asking for it already...

What the heck is "global per-dbh"? This seems a contradiction to me. :-)

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: blessing db data as utf8

am 09.06.2004 16:45:30 von Jochen Wiedmann

Re: blessing db data as utf8

am 09.06.2004 16:56:37 von Gaal Yahas

On Wed, Jun 09, 2004 at 04:45:30PM +0200, Jochen Wiedmann wrote:
> > Oops, I meant "global per-dbh" switch. (Anyone know how to turn off
> > auto-complete in the fingers?) For a first patch, would you find it
> > agreeable to have dbh level control? That's what I need pretty urgently,
> > anyway, and I'm surprised people haven't been asking for it already...
>
> What the heck is "global per-dbh"? This seems a contradiction to me. :-)

How about "of dbh scope"? A dbh property, in short.

--
Gaal Yahas
http://gaal.livejournal.com/

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: blessing db data as utf8

am 09.06.2004 16:56:37 von Gaal Yahas

[PATCH] Re: blessing db data as utf8

am 09.06.2004 22:01:03 von Gaal Yahas

On Wed, Jun 09, 2004 at 04:01:09PM +0300, Gaal Yahas wrote:
> What do the maintainers of DBD::mysql say? Should the same style of fix
> be added to DBD::mysql? I'm willing to work on a patch if nobody else steps
> forward.

Patch follows. This works for me; thanks to Dominic Mitchell
for the Pg version this is based on.

--
Gaal Yahas
http://gaal.livejournal.com/

diff -uraN -X /home/roo/diff-exclude /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.c ../DBD-mysql-2.9003/dbdimp.c
--- /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.c 2003-10-17 19:20:50.000000000 +0200
+++ ../DBD-mysql-2.9003/dbdimp.c 2004-06-09 22:15:03.000000000 +0300
@@ -848,6 +848,9 @@
imp_dbh->has_transactions = TRUE;
imp_dbh->auto_reconnect = FALSE; /* Safer we flip this to TRUE perl side
if we detect a mod_perl env. */
+#ifdef is_utf8_string
+ imp_dbh->enable_utf8 = FALSE; /* initialize mysql_enable_utf8 */
+#endif

DBIc_set(imp_dbh, DBIcf_AutoCommit, &sv_yes);
if (sv && SvROK(sv)) {
@@ -1333,6 +1336,10 @@
/*XXX: Does DBI handle the magic ? */
imp_dbh->auto_reconnect = bool_value;
/* imp_dbh->mysql.reconnect=0; */
+#ifdef is_utf8_string
+ } else if (strEQ(key, "mysql_enable_utf8")) {
+ imp_dbh->enable_utf8 = bool_value;
+#endif
} else {
return FALSE;
}
@@ -1413,6 +1420,8 @@
/* Obsolete, as of 2.09! */
const char* msg = mysql_error(&imp_dbh->mysql);
result = sv_2mortal(newSVpv(msg, strlen(msg)));
+ } else if (strEQ(key, "enable_utf8")) {
+ result = sv_2mortal(newSViv(imp_dbh->enable_utf8));
}
break;
case 'd':
@@ -1748,7 +1757,14 @@
*
************************************************************ **************/

+int is_high_bit_set(char *val) {
+ while (*val++)
+ if (*val & 0x80) return 1;
+ return 0;
+}
+
AV* dbd_st_fetch(SV* sth, imp_sth_t* imp_sth) {
+ D_imp_dbh_from_sth;
int num_fields;
int ChopBlanks;
int i;
@@ -1797,6 +1813,12 @@
}

sv_setpvn(sv, col, len);
+
+#ifdef is_utf8_string
+ if (imp_dbh->enable_utf8 &&
+ is_high_bit_set(col) && is_utf8_string(col, len))
+ SvUTF8_on(sv);
+#endif
} else {
(void) SvOK_off(sv); /* Field is NULL, return undef */
}
diff -uraN -X /home/roo/diff-exclude /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.h ../DBD-mysql-2.9003/dbdimp.h
--- /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.h 2003-10-17 19:20:50.000000000 +0200
+++ ../DBD-mysql-2.9003/dbdimp.h 2004-06-09 22:06:06.000000000 +0300
@@ -114,6 +114,9 @@
unsigned int auto_reconnects_ok;
unsigned int auto_reconnects_failed;
} stats;
+#ifdef is_utf8_string
+ bool enable_utf8; /* should we attempt to make utf8 strings? */
+#endif
};

diff -uraN -X /home/roo/diff-exclude /home/roo/.cpan/build/DBD-mysql-2.9003/lib/DBD/mysql.pm ../DBD-mysql-2.9003/lib/DBD/mysql.pm
--- /home/roo/.cpan/build/DBD-mysql-2.9003/lib/DBD/mysql.pm 2003-10-27 05:26:08.000000000 +0200
+++ ../DBD-mysql-2.9003/lib/DBD/mysql.pm 2004-06-09 22:54:21.000000000 +0300
@@ -867,6 +867,18 @@
AutoCommit is turned off, and when AutoCommit is turned off, DBD::mysql will
not automatically reconnect to the server.

+=item mysql_enable_utf8
+
+This attribute determines whether DBD::mysql should assume strings stored
+in the database are utf8. This feature defaults to off. When set, and if
+a retrieved string validates as utf8, then the magic flag on the string
+is turned on, making perl use character semantics on it. You need to
+turn this on if you store your data as utf8; otherwise you may notice
+that although data is displayed correctly when retrieved, length()
+returns results that are too large.
+
+This option is experimental and may change in future versions.
+
=head1 STATEMENT HANDLES

The statement handles of DBD::mysql support a number
diff -uraN -X /home/roo/diff-exclude /home/roo/.cpan/build/DBD-mysql-2.9003/mysql-utf8.0.patch ../DBD-mysql-2.9003/mysql-utf8.0.patch
--- /home/roo/.cpan/build/DBD-mysql-2.9003/mysql-utf8.0.patch 1970-01-01 02:00:00.000000000 +0200
+++ ../DBD-mysql-2.9003/mysql-utf8.0.patch 2004-06-09 22:55:00.000000000 +0300
@@ -0,0 +1,96 @@
+diff -uraN -X /home/roo/diff-exclude /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.c ../DBD-mysql-2.9003/dbdimp.c
+--- /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.c 2003-10-17 19:20:50.000000000 +0200
++++ ../DBD-mysql-2.9003/dbdimp.c 2004-06-09 22:15:03.000000000 +0300
+@@ -848,6 +848,9 @@
+ imp_dbh->has_transactions = TRUE;
+ imp_dbh->auto_reconnect = FALSE; /* Safer we flip this to TRUE perl side
+ if we detect a mod_perl env. */
++#ifdef is_utf8_string
++ imp_dbh->enable_utf8 = FALSE; /* initialize mysql_enable_utf8 */
++#endif
+
+ DBIc_set(imp_dbh, DBIcf_AutoCommit, &sv_yes);
+ if (sv && SvROK(sv)) {
+@@ -1333,6 +1336,10 @@
+ /*XXX: Does DBI handle the magic ? */
+ imp_dbh->auto_reconnect = bool_value;
+ /* imp_dbh->mysql.reconnect=0; */
++#ifdef is_utf8_string
++ } else if (strEQ(key, "mysql_enable_utf8")) {
++ imp_dbh->enable_utf8 = bool_value;
++#endif
+ } else {
+ return FALSE;
+ }
+@@ -1413,6 +1420,8 @@
+ /* Obsolete, as of 2.09! */
+ const char* msg = mysql_error(&imp_dbh->mysql);
+ result = sv_2mortal(newSVpv(msg, strlen(msg)));
++ } else if (strEQ(key, "enable_utf8")) {
++ result = sv_2mortal(newSViv(imp_dbh->enable_utf8));
+ }
+ break;
+ case 'd':
+@@ -1748,7 +1757,14 @@
+ *
+ ************************************************************ **************/
+
++int is_high_bit_set(char *val) {
++ while (*val++)
++ if (*val & 0x80) return 1;
++ return 0;
++}
++
+ AV* dbd_st_fetch(SV* sth, imp_sth_t* imp_sth) {
++ D_imp_dbh_from_sth;
+ int num_fields;
+ int ChopBlanks;
+ int i;
+@@ -1797,6 +1813,12 @@
+ }
+
+ sv_setpvn(sv, col, len);
++
++#ifdef is_utf8_string
++ if (imp_dbh->enable_utf8 &&
++ is_high_bit_set(col) && is_utf8_string(col, len))
++ SvUTF8_on(sv);
++#endif
+ } else {
+ (void) SvOK_off(sv); /* Field is NULL, return undef */
+ }
+diff -uraN -X /home/roo/diff-exclude /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.h ../DBD-mysql-2.9003/dbdimp.h
+--- /home/roo/.cpan/build/DBD-mysql-2.9003/dbdimp.h 2003-10-17 19:20:50.000000000 +0200
++++ ../DBD-mysql-2.9003/dbdimp.h 2004-06-09 22:06:06.000000000 +0300
+@@ -114,6 +114,9 @@
+ unsigned int auto_reconnects_ok;
+ unsigned int auto_reconnects_failed;
+ } stats;
++#ifdef is_utf8_string
++ bool enable_utf8; /* should we attempt to make utf8 strings? */
++#endif
+ };
+
+
+diff -uraN -X /home/roo/diff-exclude /home/roo/.cpan/build/DBD-mysql-2.9003/lib/DBD/mysql.pm ../DBD-mysql-2.9003/lib/DBD/mysql.pm
+--- /home/roo/.cpan/build/DBD-mysql-2.9003/lib/DBD/mysql.pm 2003-10-27 05:26:08.000000000 +0200
++++ ../DBD-mysql-2.9003/lib/DBD/mysql.pm 2004-06-09 22:54:21.000000000 +0300
+@@ -867,6 +867,18 @@
+ AutoCommit is turned off, and when AutoCommit is turned off, DBD::mysql will
+ not automatically reconnect to the server.
+
++=item mysql_enable_utf8
++
++This attribute determines whether DBD::mysql should assume strings stored
++in the database are utf8. This feature defaults to off. When set, and if
++a retrieved string validates as utf8, then the magic flag on the string
++is turned on, making perl use character semantics on it. You need to
++turn this on if you store your data as utf8; otherwise you may notice
++that although data is displayed correctly when retrieved, length()
++returns results that are too large.
++
++This option is experimental and may change in future versions.
++
+ =head1 STATEMENT HANDLES
+
+ The statement handles of DBD::mysql support a number

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

[PATCH] Re: blessing db data as utf8

am 09.06.2004 22:01:03 von Gaal Yahas

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 12:07:54 von Jochen Wiedmann

Good patch, in particular because it includes the docs!

Jochen

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 12:07:54 von Jochen Wiedmann

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 12:26:04 von Gaal Yahas

On Thu, Jun 10, 2004 at 12:07:54PM +0200, Jochen Wiedmann wrote:
>
> Good patch, in particular because it includes the docs!

Thanks. Here's a test script I've been using that may be seriousified
and ported into the DBD test suite. I didn't do that last step myself
because I was unfamiliar with the idioms.

--
Gaal Yahas
http://gaal.livejournal.com/

#!/usr/bin/perl -w
# make sure utf8 patch is applied to current DBD::mysql

use strict;
use charnames ':full';
use DBI;

my $dbh = DBI->connect("DBI:mysql:database=test", "", "",
{mysql_enable_utf8=>1}) or die "dbi: $DBI::errstr";

$dbh->{mysql_enable_utf8} or die "couldn't init mysql_enable_utf8";

# uncomment this for proof this whole feature is necessary
#$dbh->{mysql_enable_utf8} = 0; print "Your test WILL fail, you know.\n";

$dbh->do("DROP TABLE IF EXISTS test_u8");
$dbh->do(< CREATE TABLE test_u8 (
id integer,
name varchar(40)
);
DML

# "Eli".
my $name =
"\N{HEBREW LETTER ALEF}\N{HEBREW LETTER LAMED}\N{HEBREW LETTER YOD}";
length $name == 3 or die "your perl sucks. $name isn't in utf8?";

my $sth = $dbh->prepare("INSERT INTO test_u8 (id, name) values (?, ?)");
$sth->execute(1, $name);

my @row = $dbh->selectrow_array("SELECT id, name FROM test_u8;");
die "number didn't stay the same!?!?!!" unless $row[0] == 1;
die "sorry, utf8 discipline failed: len != 3" unless length $row[1] == 3;
die "sorry, utf8 discipline failed: strcmp" unless $row[1] eq $name;

print "yay, you are teh utf king!!!!11\n";
$dbh->do("DROP TABLE test_u8");

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 12:26:04 von Gaal Yahas

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 12:45:07 von Steve Hay

Jochen Wiedmann wrote:

>Good patch, in particular because it includes the docs!
>
Rudy Lippan already did something similar to this, but never released it
to CPAN. Here are links to a couple of threads discussing it:

http://marc.theaimsgroup.com/?t=106321975700026&r=1&w=2
http://www.mail-archive.com/dbi-users@perl.org/msg18360.html

He sent me a tarball containing his patched version, which he called
2.9003_2. I still have it if anybody wants it.

I haven't actually started using it yet. I want to see how things turn
out with all the new character set support stuff in 4.1.x. Getting
UTF-8 data stored in the database back into properly flagged Perl
strings without mangling anything is only part of the problem. How do
you perform SQL SELECT's on such data in the database without the
database understanding that the bytes it is storing are UTF-8 characters?

Are there any plans to have DBD::mysql hook into the new charset stuff
in 4.1.x and (for example) automatically handle UTF-8 data properly for
columns/tables/databases marked within MySQL itself as UTF-8?

- Steve

------------------------------------------------
Radan Computational Ltd.

The information contained in this message and any files transmitted with it are confidential and intended for the addressee(s) only. If you have received this message in error or there are any problems, please notify the sender immediately. The unauthorized use, disclosure, copying or alteration of this message is strictly forbidden. Note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Radan Computational Ltd. The recipient(s) of this message should check it and any attached files for viruses: Radan Computational will accept no liability for any damage caused by any virus transmitted by this email.

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 12:45:07 von Steve Hay

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 20:17:28 von Gaal Yahas

On Thu, Jun 10, 2004 at 11:45:07AM +0100, Steve Hay wrote:

> Getting UTF-8 data stored in the database back into properly flagged Perl
> strings without mangling anything is only part of the problem. How do
> you perform SQL SELECT's on such data in the database without the
> database understanding that the bytes it is storing are UTF-8 characters?

Having the database knowing about utf8 is (very) nice to have, but it
isn't essential. '=' and LIKE should continue to work thanks to the
cleverness of utf8; of course, collating and therefore ORDER BY won't
work correctly either, and the sizes the database knows about will all
be in bytes instead of characters. Bothersome but not insurmountable. :)

--
Gaal Yahas
http://gaal.livejournal.com/

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: [PATCH] Re: blessing db data as utf8

am 10.06.2004 20:17:28 von Gaal Yahas

Re: blessing db data as utf8

am 10.06.2004 20:49:57 von Gaal Yahas

[I hope nobody minds that I'm moving this thread to the DBD::mysql list,
because it seems like the best place for it. Please drop cdbi-talk
from replies.]

On Thu, Jun 10, 2004 at 07:01:30PM +0100, Tim Bunce wrote:
> On Thu, Jun 10, 2004 at 12:18:42PM +0300, Gaal Yahas wrote:
> > On Thu, Jun 10, 2004 at 09:51:06AM +0100, Tim Bunce wrote:
> > > This isn't a good way to check for utf8:
> > >
> > > +int is_high_bit_set(char *val) {
> > > + while (*val++)
> > > + if (*val & 0x80) return 1;
> > > + return 0;
> > > +}
> > >
> > > because it make it hard for any latin-1 data to coexist.
> > > The perl guts probably has a function to check for well-formed utf8
> > > and that should be used instead.
> >
> > This function is only used as an optimization. The actual decision is here:
> >
> > + if (imp_dbh->enable_utf8 &&
> > + is_high_bit_set(col) && is_utf8_string(col, len))
> > + SvUTF8_on(sv);
>
> Ah, okay.
>
> > That said, bad things are going to happen sooner of later if a table has
> > both latin-1 and utf8 data.
>
> I'm thinking more about different fields having either latin-1 or utf8 data.
>
> > But now that I think of it, I'm not sure the call to is_high_bit_set is
> > a good idea there, since SvUTF8_on() on a pure (7 bit) ASCII string
> > shouldn't do any harm
>
> It does add overhead (and is actually harmful on 5.6.x where many
> utf8 bugs lurk) so the check is worthwhile.
>
> > and may even be more correct if the string is later concatenated
> > with utf8 data.
>
> No, perl will do-the-right-thing.

So all in all it sounds like this patch is simple, but correct? Steve
Hay mentioned another similar patch had been written but didn't reach CPAN;
I'd like to encourage the maintainers to put either version :-)

> > I'm not sure what the cleanest way would be to go about this in the
> > long run (whose responsibility it is to say what is and what isn't
> > utf8) but the patch addresses an immediate need for people with
> > utf8-only data. Maybe this problem would go away in mysql 4.1; I'd
> > prefer not to wait.
>
> Something along these lines is needed. But it does require careful thought.

Perhaps the application, or Class::DBI::mysql (which already has some
provisions for similar things) should be responsible for keeping track
of what fields are which charset, with no policy (except a default one)
being enforced on the DBD level. In this scheme the current approach
becomes part of the default handling, so it still makes sense to put it
in now.

--
Gaal Yahas
http://gaal.livejournal.com/

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: blessing db data as utf8

am 10.06.2004 20:49:57 von Gaal Yahas

Re: [PATCH] Re: blessing db data as utf8

am 11.06.2004 10:14:21 von Steve Hay

Gaal Yahas wrote:

>On Thu, Jun 10, 2004 at 11:45:07AM +0100, Steve Hay wrote:
>
>
>
>>Getting UTF-8 data stored in the database back into properly flagged Perl
>>strings without mangling anything is only part of the problem. How do
>>you perform SQL SELECT's on such data in the database without the
>>database understanding that the bytes it is storing are UTF-8 characters?
>>
>>
>
>Having the database knowing about utf8 is (very) nice to have, but it
>isn't essential. '=' and LIKE should continue to work thanks to the
>cleverness of utf8;
>
I was concerned that searching for some sequence of bytes that make up a
UTF-8 character might accidentally match in the wrong place, like the
last byte of one character and the first byte of another. Are you
implying that this can't ever happen because of how UTF-8 works? I've
never really looked into the detail of the UTF-8 coding; I've just used
interfaces that manipulate it and took the view that the internals don't
really interest me (and shouldn't do, if I'm doing things properly).

If it's true, then it certainly does alleviate some of the pain.

What about things like UPPER() and LOWER(), though? Presumably they're
not going to work because they'll operate on bytes and completely screw
everything up?

>of course, collating and therefore ORDER BY won't
>work correctly either, and the sizes the database knows about will all
>be in bytes instead of characters. Bothersome but not insurmountable. :)
>
I assume you mean pull the data into Perl, have the data correctly
flagged as UTF-8 there, and doing things like sorting in the Perl code?

I could live with that, but the UPPER()/LOWER() issue is more of a
problem. I make a lot of use of them and it's not so easy to workaround
in the Perl.

- Steve

------------------------------------------------
Radan Computational Ltd.

The information contained in this message and any files transmitted with it are confidential and intended for the addressee(s) only. If you have received this message in error or there are any problems, please notify the sender immediately. The unauthorized use, disclosure, copying or alteration of this message is strictly forbidden. Note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Radan Computational Ltd. The recipient(s) of this message should check it and any attached files for viruses: Radan Computational will accept no liability for any damage caused by any virus transmitted by this email.

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: [PATCH] Re: blessing db data as utf8

am 11.06.2004 10:14:21 von Steve Hay

Re: [PATCH] Re: blessing db data as utf8

am 11.06.2004 11:41:30 von Gaal Yahas

On Fri, Jun 11, 2004 at 09:14:21AM +0100, Steve Hay wrote:
> I was concerned that searching for some sequence of bytes that make up a
> UTF-8 character might accidentally match in the wrong place, like the
> last byte of one character and the first byte of another. Are you
> implying that this can't ever happen because of how UTF-8 works? I've
> never really looked into the detail of the UTF-8 coding; I've just used
> interfaces that manipulate it and took the view that the internals don't
> really interest me (and shouldn't do, if I'm doing things properly).
>
> If it's true, then it certainly does alleviate some of the pain.

Yes, utf8 is self-synchronizing. If both the needle and the haystack are
utf8, you don't get false positives even with non-utf8-aware strcmp-like
code.

> What about things like UPPER() and LOWER(), though? Presumably they're
> not going to work because they'll operate on bytes and completely screw
> everything up?

True, that will not work.

> >of course, collating and therefore ORDER BY won't
> >work correctly either, and the sizes the database knows about will all
> >be in bytes instead of characters. Bothersome but not insurmountable. :)
> >
> I assume you mean pull the data into Perl, have the data correctly
> flagged as UTF-8 there, and doing things like sorting in the Perl code?
>
> I could live with that, but the UPPER()/LOWER() issue is more of a
> problem. I make a lot of use of them and it's not so easy to workaround
> in the Perl.

You're right, but there's not much we can do about it until the database
supports utf8 natively.

Out of curiosity, where do you make use of this? Case-insensetive lookups
that preserve the original case?

Note that ORDER BY and UPPER()/LOWER() will continue to work on the subset
of your strings that happen to be ASCII, even if some of your data is not.

--
Gaal Yahas
http://gaal.livejournal.com/

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: [PATCH] Re: blessing db data as utf8

am 11.06.2004 11:41:30 von Gaal Yahas

Re: [PATCH] Re: blessing db data as utf8

am 11.06.2004 12:03:30 von Steve Hay

Gaal Yahas wrote:

>On Fri, Jun 11, 2004 at 09:14:21AM +0100, Steve Hay wrote:
>
>
>>I was concerned that searching for some sequence of bytes that make up a
>>UTF-8 character might accidentally match in the wrong place, like the
>>last byte of one character and the first byte of another. Are you
>>implying that this can't ever happen because of how UTF-8 works?
>>
>>
>
>Yes, utf8 is self-synchronizing. If both the needle and the haystack are
>utf8, you don't get false positives even with non-utf8-aware strcmp-like
>code.
>
Cool. That's a very useful thing to know.

>>>of course, collating and therefore ORDER BY won't
>>>work correctly either, and the sizes the database knows about will all
>>>be in bytes instead of characters. Bothersome but not insurmountable. :)
>>>
>>>
>>>
>>I assume you mean pull the data into Perl, have the data correctly
>>flagged as UTF-8 there, and doing things like sorting in the Perl code?
>>
>>I could live with that, but the UPPER()/LOWER() issue is more of a
>>problem. I make a lot of use of them and it's not so easy to workaround
>>in the Perl.
>>
>>
>
>You're right, but there's not much we can do about it until the database
>supports utf8 natively.
>
>
>Out of curiosity, where do you make use of this? Case-insensetive lookups
>that preserve the original case?
>
Yes, exactly that. I'm dealing with software that indexes various
things read out of XML files into the database. Users can search by
either the data that was extracted or by the filenames. Either way,
they want to do case-insensitive searches, but see the original case in
the results.

This is particularly relevant to the filenames themselves because this
is all on Windows which has a case-insensitive but case-preserving
filesystem.

- Steve

------------------------------------------------
Radan Computational Ltd.

The information contained in this message and any files transmitted with it are confidential and intended for the addressee(s) only. If you have received this message in error or there are any problems, please notify the sender immediately. The unauthorized use, disclosure, copying or alteration of this message is strictly forbidden. Note that any views or opinions presented in this email are solely those of the author and do not necessarily represent those of Radan Computational Ltd. The recipient(s) of this message should check it and any attached files for viruses: Radan Computational will accept no liability for any damage caused by any virus transmitted by this email.

--
MySQL Perl Mailing List
For list archives: http://lists.mysql.com/perl
To unsubscribe: http://lists.mysql.com/perl?unsub=gcdmp-msql-mysql-modules@m .gmane.org

Re: [PATCH] Re: blessing db data as utf8

am 11.06.2004 12:03:30 von Steve Hay