UTF-8 html entity decoding

UTF-8 html entity decoding

am 02.11.2007 21:02:01 von offsky

I have a string that has UTF-8 characters encoded using html
entities. For example the string "é å­=97" is being encoded as "&=
#233;
字". I have no control over how this string is given to me, so
I need to figure out a way to decode "é 字" back into "é
å­=97".

I have already tried urldecode, html_entity_decode, utf8_decode and
convert_uudecode without success. My server environment is limited to
the latest version of PHP 4, so I cant use any PHP 5 stuff.

Anyone have suggestions?

Re: UTF-8 html entity decoding

am 02.11.2007 21:42:20 von darko

On Nov 2, 9:02 pm, Jake wrote:
> I have a string that has UTF-8 characters encoded using html
> entities. For example the string "é å­=97" is being encoded as =

> 字". I have no control over how this string is given to me, so
> I need to figure out a way to decode "é 字" back into "é
> å­=97".
>
> I have already tried urldecode, html_entity_decode, utf8_decode and
> convert_uudecode without success. My server environment is limited to
> the latest version of PHP 4, so I cant use any PHP 5 stuff.
>
> Anyone have suggestions?

Here's the sample from php.net's page about utf8_encode (http://
www.php.net/manual/en/function.utf8-encode.php), thanks to certain
luka8088:

function html_to_utf8 ($data)
{
return preg_replace("/\\&\\#([0-9]{3,10})\\;/e", '_html_to_utf8("\
\1")', $data);
}

function _html_to_utf8 ($data)
{
if ($data > 127)
{
$i =3D 5;
while (($i--) > 0)
{
if ($data !=3D ($a =3D $data % ($p =3D pow(64, $i))))
{
$ret =3D chr(base_convert(str_pad(str_repeat(1, $i + 1),
8, "0"), 2, 10) + (($data - $a) / $p));
for ($i; $i > 0; $i--)
$ret .=3D chr(128 + ((($data % pow(64, $i)) - ($data
% ($p =3D pow(64, $i - 1)))) / $p));
break;
}
}
} else
$ret =3D "&#$data;";
return $ret;
}

Example:
echo html_to_utf8("a b č ć ž こ に ち
わ ()[]{}!#$?* < >");

Output:
a b č ć ž ã=93 ã=AB ã=A1 ã‚=8F ()[]{}!#$?=
* < >

Cheers

Re: UTF-8 html entity decoding

am 03.11.2007 19:14:16 von offsky

Thanks for that. I independently found another solution:

function unichr($c) {
if ($c <= 0x7F) {
return chr($c);
} else if ($c <= 0x7FF) {
return chr(0xC0 | $c >> 6) . chr(0x80 | $c & 0x3F);
} else if ($c <= 0xFFFF) {
return chr(0xE0 | $c >> 12) . chr(0x80 | $c >> 6 & 0x3F)
. chr(0x80 | $c & 0x3F);
} else if ($c <= 0x10FFFF) {
return chr(0xF0 | $c >> 18) . chr(0x80 | $c >> 12 & 0x3F)
. chr(0x80 | $c >> 6 & 0x3F)
. chr(0x80 | $c & 0x3F);
} else {
return false;
}
}
$text=preg_replace('~&#([0-9]+);~e', 'unichr("\\1")',
html_entity_decode($text));