utf8_decode() and mixed character sets

utf8_decode() and mixed character sets

am 11.10.2009 05:40:59 von James Colannino

Hey everyone. I'd been troubled for a while by the fact that inserting
cut-pasted special characters such as ä caused truncation when passed to
MySQL, then discovered that it was because I was cutting and pasting unicode
values into non-unicode Latin-1 strings.

Since Latin-1 also has equivalent values, I was hoping that filtering my mixed
unicode/non-unicode string through utf8_decode() would solve the problem, but
instead, where the unicode character used to be, I now get a '?', followed by a
few characters being taken out of the middle. I'm guessing that this is because
utf8_decode() assumes the whole string is unicode and therefore removes a bunch
of extra bytes from the string and corrupts it. At least, that's my guess. I
could be very wrong (I have pretty much no experience with different character
sets...)

My question is, what's a good way to translate unicode characters in a
non-unicode string to their Latin-1 equivalents? I need to be able to do this
in order to sanitize a fairly common form of input.

Thanks!

James

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: utf8_decode() and mixed character sets

am 11.10.2009 05:48:31 von Andrew Ballard

On Sat, Oct 10, 2009 at 11:40 PM, James Colannino wro=
te:
>
> Hey everyone.  I'd been troubled for a while by the fact that insert=
ing
> cut-pasted special characters such as ä caused truncation when passe=
d to
> MySQL, then discovered that it was because I was cutting and pasting unic=
ode
> values into non-unicode Latin-1 strings.
>
> Since Latin-1 also has equivalent values, I was hoping that filtering my =
mixed
> unicode/non-unicode string through utf8_decode() would solve the problem,=
but
> instead, where the unicode character used to be, I now get a '?', followe=
d by a
> few characters being taken out of the middle.  I'm guessing that thi=
s is because
> utf8_decode() assumes the whole string is unicode and therefore removes a=
bunch
> of extra bytes from the string and corrupts it.  At least, that's my=
guess.  I
> could be very wrong (I have pretty much no experience with different char=
acter
> sets...)
>
> My question is, what's a good way to translate unicode characters in a
> non-unicode string to their Latin-1 equivalents?  I need to be able =
to do this
> in order to sanitize a fairly common form of input.
>
> Thanks!
>
> James
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>

Have you tried iconv or mb_string? Is it a option to update the
database to use UTF-8?

Andrew

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: utf8_decode() and mixed character sets

am 11.10.2009 06:16:17 von James Colannino

Andrew Ballard wrote:

> Have you tried iconv or mb_string? Is it a option to update the
> database to use UTF-8?

I'll look into those functions. And, I suppose I could in fact convert my
database to use UTF-8 if necessary.

James

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php