REGEXP and unicode weirdness

REGEXP and unicode weirdness

am 21.01.2010 16:27:34 von John Campbell

I want to find rows that contain a word that matches a term, accent
insensitive: I am using utf8-general collation everywhere.

attempt 1:
SELECT * FROM t WHERE txt LIKE '%que%'
Matches que qué, but also matches 'queue'

attempt 1.5:
SELECT * FROM t WHERE txt LIKE '% que %' OR LIKE 'que %' OR LIKE '% que';
Almost, but misses "que!" or 'que...'

attempt2:
SELECT * FROM t WHERE txt REGEXP '[[:<:]]que[[:>:]]'
Matches que, not queue, but doesn't match qué.

attempt3
SELECT * FROM t WHERE txt REGEXP '[[:<:]]q[uùúûüũ=
ūŭůűųǔǖǘǚǜ ][eèé=
êëēĕėęě][[:>:]]'
Matches que, queue, qué. (I have no idea why this matches queue, but
the Regex behavior is bizarre with unicode.)

Does anyone know why the final regex acts weird? It there a good solution?

Thanks in advance,
John Campbell

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: REGEXP and unicode weirdness

am 21.01.2010 21:52:23 von Paul DuBois

On Jan 21, 2010, at 9:27 AM, John Campbell wrote:

> I want to find rows that contain a word that matches a term, accent
> insensitive: I am using utf8-general collation everywhere.
>=20
> attempt 1:
> SELECT * FROM t WHERE txt LIKE '%que%'
> Matches que qué, but also matches 'queue'
>=20
> attempt 1.5:
> SELECT * FROM t WHERE txt LIKE '% que %' OR LIKE 'que %' OR LIKE '%=
que';
> Almost, but misses "que!" or 'que...'
>=20
> attempt2:
> SELECT * FROM t WHERE txt REGEXP '[[:<:]]que[[:>:]]'
> Matches que, not queue, but doesn't match qué.
>=20
> attempt3
> SELECT * FROM t WHERE txt REGEXP '[[:<:]]q[uùúû=C3=
¼Å©Å«Å­Å¯Å±Å³Ç”Ç–Ç˜Ç šÇ=
=9C][eèéêëēĕėęě ][[:>:]]'
> Matches que, queue, qué. (I have no idea why this matches que=
ue, but
> the Regex behavior is bizarre with unicode.)
>=20
> Does anyone know why the final regex acts weird? It there a good s=
olution?


http://dev.mysql.com/doc/refman/5.1/en/regexp.html:

Warning
The REGEXP and RLIKE operators work in byte-wise fashion, so they are=
not multi-byte safe and may produce unexpected results with multi-by=
te character sets. In addition, these operators compare characters by=
their byte values and accented characters may not compare as equal e=
ven if a given collation treats them as equal.

--=20
Paul DuBois
Sun Microsystems / MySQL Documentation Team
Madison, Wisconsin, USA
www.mysql.com


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: REGEXP and unicode weirdness

am 21.01.2010 22:13:03 von Tom Worster

On 1/21/10 10:27 AM, "John Campbell" wrote:

> I want to find rows that contain a word that matches a term, accent
> insensitive: I am using utf8-general collation everywhere.
>=20
> attempt 1:
> SELECT * FROM t WHERE txt LIKE '%que%'
> Matches que qué, but also matches 'queue'
>=20
> attempt 1.5:
> SELECT * FROM t WHERE txt LIKE '% que %' OR LIKE 'que %' OR LIKE '% que';
> Almost, but misses "que!" or 'que...'
>=20
> attempt2:
> SELECT * FROM t WHERE txt REGEXP '[[:<:]]que[[:>:]]'
> Matches que, not queue, but doesn't match qué.
>=20
> attempt3
> SELECT * FROM t WHERE txt REGEXP
> '[[:<:]]q[uùúûüũūŭůűųǔǖǘǚǜ][eèéêëēĕėęě][[:>:]]'
> Matches que, queue, qué. (I have no idea why this matches queue, but
> the Regex behavior is bizarre with unicode.)
>=20
> Does anyone know why the final regex acts weird?

"Warning

"The REGEXP and RLIKE operators work in byte-wise fashion, so they are not
multi-byte safe and may produce unexpected results with multi-byte characte=
r
sets. In addition, these operators compare characters by their byte values
and accented characters may not compare as equal even if a given collation
treats them as equal." -- Mysql 11.4.2


> It there a good solution?

doesn't look like it.

Sphinxsearch might work nicely for you (it does for me) but that may not be
an option for you. i generated a Sphinxsearch charset_table config that
mimics utf8_general_ci collation.



--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg