UTF-8 html entity decoding
am 02.11.2007 21:02:01 von offsky
I have a string that has UTF-8 characters encoded using html
entities. For example the string "é å=97" is being encoded as "&=
#233;
字". I have no control over how this string is given to me, so
I need to figure out a way to decode "é 字" back into "é
å=97".
I have already tried urldecode, html_entity_decode, utf8_decode and
convert_uudecode without success. My server environment is limited to
the latest version of PHP 4, so I cant use any PHP 5 stuff.
Anyone have suggestions?
Re: UTF-8 html entity decoding
am 02.11.2007 21:42:20 von darko
On Nov 2, 9:02 pm, Jake wrote:
> I have a string that has UTF-8 characters encoded using html
> entities. For example the string "é å=97" is being encoded as =
"é
> 字". I have no control over how this string is given to me, so
> I need to figure out a way to decode "é 字" back into "é
> å=97".
>
> I have already tried urldecode, html_entity_decode, utf8_decode and
> convert_uudecode without success. My server environment is limited to
> the latest version of PHP 4, so I cant use any PHP 5 stuff.
>
> Anyone have suggestions?
Here's the sample from php.net's page about utf8_encode (http://
www.php.net/manual/en/function.utf8-encode.php), thanks to certain
luka8088:
function html_to_utf8 ($data)
{
return preg_replace("/\\&\\#([0-9]{3,10})\\;/e", '_html_to_utf8("\
\1")', $data);
}
function _html_to_utf8 ($data)
{
if ($data > 127)
{
$i =3D 5;
while (($i--) > 0)
{
if ($data !=3D ($a =3D $data % ($p =3D pow(64, $i))))
{
$ret =3D chr(base_convert(str_pad(str_repeat(1, $i + 1),
8, "0"), 2, 10) + (($data - $a) / $p));
for ($i; $i > 0; $i--)
$ret .=3D chr(128 + ((($data % pow(64, $i)) - ($data
% ($p =3D pow(64, $i - 1)))) / $p));
break;
}
}
} else
$ret =3D "$data;";
return $ret;
}
Example:
echo html_to_utf8("a b č ć ž こ に ち
わ ()[]{}!#$?* < >");
Output:
a b Ä Ä Å¾ ã=93 ã=AB ã=A1 ã=8F ()[]{}!#$?=
* < >
Cheers
Re: UTF-8 html entity decoding
am 03.11.2007 19:14:16 von offsky
Thanks for that. I independently found another solution:
function unichr($c) {
if ($c <= 0x7F) {
return chr($c);
} else if ($c <= 0x7FF) {
return chr(0xC0 | $c >> 6) . chr(0x80 | $c & 0x3F);
} else if ($c <= 0xFFFF) {
return chr(0xE0 | $c >> 12) . chr(0x80 | $c >> 6 & 0x3F)
. chr(0x80 | $c & 0x3F);
} else if ($c <= 0x10FFFF) {
return chr(0xF0 | $c >> 18) . chr(0x80 | $c >> 12 & 0x3F)
. chr(0x80 | $c >> 6 & 0x3F)
. chr(0x80 | $c & 0x3F);
} else {
return false;
}
}
$text=preg_replace('~([0-9]+);~e', 'unichr("\\1")',
html_entity_decode($text));