Recognizing utf8 encoded data in latin1 fields/tables

am 08.05.2011 20:29:56 von Jigal van Hemert

Hi,

The TYPO3 CMS I'm working on uses UTF-8 database fields for some time
now by default. There are sometimes old installation, which have been
updated without properly converting the database. The result: UTF-8
encoded data in (most often) latin1 tables/fields.

I have a conversion script which analyses the table definitions and uses
the "trick" of two alter table operations (first to the binary
equivalent of the column type and then to the normal type with the utf8
charset) to convert the data to the correct character set.

It would be nice to be able to detect this situation using queries only
(faster than transferring the data into the PHP script and analysing it
there).

I have been fiddling a bit with a few columns:
test: latin1 (latin1-swedish-ci) contains UTF-8 encoded data
test1: latin1 (latin1-swedish-ci) contains latin1 encoded data

test: LandrÃ«Ã©Ã¼Ã¶Ã¯Ã
CONVERT(BINARY `test` USING utf8): Landrëéüöïß
CONVERT(`test` USING utf8) : LandrÃ«Ã©Ã¼Ã¶Ã¯Ã

test1: Landrëéüöïß
CONVERT(BINARY `test1` USING utf8) : Landr
CONVERT(`test1` USING utf8) : Landrëéüöïß

I'm now looking for an expression which can differentiate between the
two situations if possible without having to look for all possible
combinations of the encoded data.

--
Kind regards / met vriendelijke groet,

Jigal van Hemert.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org