troubles with complex UTF-8 characters

troubles with complex UTF-8 characters

am 06.10.2009 11:03:12 von Daniel Drake

Hi,

I'm having trouble working with specific UTF-8 characters. For
example, the U+10330 character (UTF8: 0xF0 0x90 0x8C 0xB0).

Background: I am trying to clone wiktionary onto local intranets in a
series of (disconnected) schools in Nepal. I'm encountering these
problems when trying to import their big db dump, but have narrowed it
down to a simple test-case below.

I am using MySQL-5.0.77 client and server on Linux. I know these kinds
of problems are commonly user errors, but I think I've covered all the
bases.

First, my command line environment:
# locale
LANG=3Den_US.UTF-8
LC_CTYPE=3D"en_US.UTF-8"
LC_NUMERIC=3D"en_US.UTF-8"
LC_TIME=3D"en_US.UTF-8"
LC_COLLATE=3D"en_US.UTF-8"
LC_MONETARY=3D"en_US.UTF-8"
LC_MESSAGES=3D"en_US.UTF-8"
LC_PAPER=3D"en_US.UTF-8"
LC_NAME=3D"en_US.UTF-8"
LC_ADDRESS=3D"en_US.UTF-8"
LC_TELEPHONE=3D"en_US.UTF-8"
LC_MEASUREMENT=3D"en_US.UTF-8"
LC_IDENTIFICATION=3D"en_US.UTF-8"

all UTF-8.

snippets of my my.cnf:
[mysqld]
character_set_server=3Dutf8

[mysql]
default-character-set=3Dutf8


inside mysql:
mysql> SHOW VARIABLES LIKE "character\_set\_%";
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
+--------------------------+--------+


Hopefully I have convinced you that I am running in a true UTF-8
environment. Now onto the issue: in the above environment I run:

CREATE TABLE dsd (
`page_id` int(10) unsigned NOT NULL auto_increment,
`page_title` varchar(255) character set utf8 collate utf8_bin NOT NULL,
PRIMARY KEY (`page_id`),
UNIQUE KEY `name_title` (`page_title`)
)

now I insert one record with a known-working UTF-8 character:

INSERT INTO dsd (page_title) VALUES (0xc2a3);

This is the UK pound sign: =A3
http://www.fileformat.info/info/unicode/char/00a3/index.htm

Running a SELECT statement shows that this was inserted just fine.


Now the problematic character:

INSERT INTO dsd (page_title) VALUES (0xf0908cb0);

This character is http://www.fileformat.info/info/unicode/char/10330/index.=
htm

This gives me the warning:
Warning (Code 1366): Incorrect string value: '\xF0\x90\x8C\xB0' for
column 'page_title' at row 1

and results in a zero-length string being inserted instead.

Can anyone else reproduce this? This is definitely a valid UTF-8
character. Why is MySQL rejecting it?

The same happens if I input the character directly (rather than using
the hex representation) and also if I input that character directly
from a UTF-8 text file. Any ideas?

Thanks,
Daniel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: troubles with complex UTF-8 characters

am 06.10.2009 12:00:24 von Jaime Crespo

On Martes, 6 de Octubre de 2009 11:03:12 Daniel Drake escribi=F3:
> Hi,
>=20
> I'm having trouble working with specific UTF-8 characters. For
> example, the U+10330 character (UTF8: 0xF0 0x90 0x8C 0xB0).

MySQL currently only supports Basic Multilingual Plane characters: up to 3-
byte utf8 on its stable releases.


AFAIK, 4-byte encoding feature is planned, but not yet released. You could=
=20
store non-BMP text into a blob (or other binary type), if you do not mind=20
losing the benefits of character-aware fields (collation, etc.).

=2D-=20
Jaime Crespo
MySQL & Java Instructor
Warp Networks=20
http://warp.es

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg