troubles with complex UTF-8 characters
am 06.10.2009 11:03:12 von Daniel DrakeHi,
I'm having trouble working with specific UTF-8 characters. For
example, the U+10330 character (UTF8: 0xF0 0x90 0x8C 0xB0).
Background: I am trying to clone wiktionary onto local intranets in a
series of (disconnected) schools in Nepal. I'm encountering these
problems when trying to import their big db dump, but have narrowed it
down to a simple test-case below.
I am using MySQL-5.0.77 client and server on Linux. I know these kinds
of problems are commonly user errors, but I think I've covered all the
bases.
First, my command line environment:
# locale
LANG=3Den_US.UTF-8
LC_CTYPE=3D"en_US.UTF-8"
LC_NUMERIC=3D"en_US.UTF-8"
LC_TIME=3D"en_US.UTF-8"
LC_COLLATE=3D"en_US.UTF-8"
LC_MONETARY=3D"en_US.UTF-8"
LC_MESSAGES=3D"en_US.UTF-8"
LC_PAPER=3D"en_US.UTF-8"
LC_NAME=3D"en_US.UTF-8"
LC_ADDRESS=3D"en_US.UTF-8"
LC_TELEPHONE=3D"en_US.UTF-8"
LC_MEASUREMENT=3D"en_US.UTF-8"
LC_IDENTIFICATION=3D"en_US.UTF-8"
all UTF-8.
snippets of my my.cnf:
[mysqld]
character_set_server=3Dutf8
[mysql]
default-character-set=3Dutf8
inside mysql:
mysql> SHOW VARIABLES LIKE "character\_set\_%";
+--------------------------+--------+
| Variable_name | Value |
+--------------------------+--------+
| character_set_client | utf8 |
| character_set_connection | utf8 |
| character_set_database | utf8 |
| character_set_filesystem | binary |
| character_set_results | utf8 |
| character_set_server | utf8 |
| character_set_system | utf8 |
+--------------------------+--------+
Hopefully I have convinced you that I am running in a true UTF-8
environment. Now onto the issue: in the above environment I run:
CREATE TABLE dsd (
`page_id` int(10) unsigned NOT NULL auto_increment,
`page_title` varchar(255) character set utf8 collate utf8_bin NOT NULL,
PRIMARY KEY (`page_id`),
UNIQUE KEY `name_title` (`page_title`)
)
now I insert one record with a known-working UTF-8 character:
INSERT INTO dsd (page_title) VALUES (0xc2a3);
This is the UK pound sign: =A3
http://www.fileformat.info/info/unicode/char/00a3/index.htm
Running a SELECT statement shows that this was inserted just fine.
Now the problematic character:
INSERT INTO dsd (page_title) VALUES (0xf0908cb0);
This character is http://www.fileformat.info/info/unicode/char/10330/index.=
htm
This gives me the warning:
Warning (Code 1366): Incorrect string value: '\xF0\x90\x8C\xB0' for
column 'page_title' at row 1
and results in a zero-length string being inserted instead.
Can anyone else reproduce this? This is definitely a valid UTF-8
character. Why is MySQL rejecting it?
The same happens if I input the character directly (rather than using
the hex representation) and also if I input that character directly
from a UTF-8 text file. Any ideas?
Thanks,
Daniel
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg