Detecting The Encoding Of A Text File

Detecting The Encoding Of A Text File

am 26.11.2009 05:55:31 von Nitsan Bin-Nun

--0016e659f40c4bfc9c04793efd60
Content-Type: text/plain; charset=ISO-8859-1

Hi,

I have been trying for the last couple of hours to determine the
encoding of a text file (.txt in windowz).

I have this code:

$contents = file_get_contents($config['
txt_dir'] . $file);
$encoding = mb_detect_encoding($contents,
"UTF-8,ISO-8859-1,WINDOWS-1252"); //,Windows-1255

echo "||encoding:".$encoding."||";

if ($encoding == 'UTF-8')
{
$utfcontents = $contents;
}
else if ($encoding == 'ISO-8859-1')
{
$utfcontents = utf8_encode($contents);
}

var_dump($utfcontents);

The $encoding is ISO-8859-1, the text file contains Hebrew characters, then
I'm converting it to utf8.

The above code is outputing gibbrish, it seems that it has converted it in
some way but not in the
proper way that it should have converted it.

My page is UTF-8 encoded, without BOM, I send UTF-8 headers to the browser
and HTML content
encoding meta tag as well.

I have no idea what I am doing wrong.

I would highly appreciate it if someone could point me to the right
direction.

Thanks in Advance,

Nitsan

--0016e659f40c4bfc9c04793efd60--

Re: Detecting The Encoding Of A Text File

am 26.11.2009 09:17:43 von news.NOSPAM.0ixbtqKe

On Thu, 26 Nov 2009 06:55:31 +0200, Nitsan Bin-Nun wrote:

> Hi,
>
> I have been trying for the last couple of hours to determine the
> encoding of a text file (.txt in windowz).
>
> I have this code:
>
> $contents = file_get_contents($config['
> txt_dir'] . $file);
> $encoding = mb_detect_encoding($contents,
> "UTF-8,ISO-8859-1,WINDOWS-1252"); //,Windows-1255
>
> echo "||encoding:".$encoding."||";
>
> if ($encoding == 'UTF-8')
> {
> $utfcontents = $contents;
> }
> else if ($encoding == 'ISO-8859-1')
> {
> $utfcontents = utf8_encode($contents);
> }
>
> var_dump($utfcontents);
>
> The $encoding is ISO-8859-1, the text file contains Hebrew characters, then
> I'm converting it to utf8.
>
> The above code is outputing gibbrish, it seems that it has converted it in
> some way but not in the
> proper way that it should have converted it.

If you know that the file contains Hebrew, maybe you should
try converting from ISO-8859-8?


/Nisse

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Re: Detecting The Encoding Of A Text File

am 26.11.2009 13:32:44 von Nitsan Bin-Nun

--0016e65c8c0269c997047945607d
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

Someone have already suggested it but I haven't tried it yet.

The thing is that right now it contains Hebrew, but tommorrow this file wil=
l
be in German or any other accented language.
I'm trying to create a function which would detect the encoding and convert
it into UTF8.

(I don't have much experience in encoding.. :( )

2009/11/26 Nisse Engström

> On Thu, 26 Nov 2009 06:55:31 +0200, Nitsan Bin-Nun wrote:
>
> > Hi,
> >
> > I have been trying for the last couple of hours to determine the
> > encoding of a text file (.txt in windowz).
> >
> > I have this code:
> >
> > $contents =3D file_get_contents($config['
> > txt_dir'] . $file);
> > $encoding =3D mb_detect_encoding($contents,
> > "UTF-8,ISO-8859-1,WINDOWS-1252"); //,Windows-1255
> >
> > echo "||encoding:".$encoding."||";
> >
> > if ($encoding == 'UTF-8')
> > {
> > $utfcontents =3D $contents;
> > }
> > else if ($encoding == 'ISO-8859-1')
> > {
> > $utfcontents =3D utf8_encode($contents);
> > }
> >
> > var_dump($utfcontents);
> >
> > The $encoding is ISO-8859-1, the text file contains Hebrew characters,
> then
> > I'm converting it to utf8.
> >
> > The above code is outputing gibbrish, it seems that it has converted it
> in
> > some way but not in the
> > proper way that it should have converted it.
>
> If you know that the file contains Hebrew, maybe you should
> try converting from ISO-8859-8?
>
>
> /Nisse
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>

--0016e65c8c0269c997047945607d--

Re: Re: Detecting The Encoding Of A Text File

am 26.11.2009 14:38:35 von Ashley Sheridan

--=-pU3hg1VueIYQXDdw6EGd
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

On Thu, 2009-11-26 at 15:39 +0200, דניאל ד=D7=
 ×•×=9F wrote:

> If windows notepad can detect encoding there must be a way to do it
> yourself.
>=20
> Maybe try to get the file's headers, I think it should also contain the
> encoding of the file...
>=20
> 2009/11/26 Nitsan Bin-Nun
>=20
> > Someone have already suggested it but I haven't tried it yet.
> >
> > The thing is that right now it contains Hebrew, but tommorrow this file
> > will
> > be in German or any other accented language.
> > I'm trying to create a function which would detect the encoding and con=
vert
> > it into UTF8.
> >
> > (I don't have much experience in encoding.. :( )
> >
> > 2009/11/26 Nisse Engström
> >
> > > On Thu, 26 Nov 2009 06:55:31 +0200, Nitsan Bin-Nun wrote:
> > >
> > > > Hi,
> > > >
> > > > I have been trying for the last couple of hours to determine the
> > > > encoding of a text file (.txt in windowz).
> > > >
> > > > I have this code:
> > > >
> > > > $contents =3D file_get_contents($config['
> > > > txt_dir'] . $file);
> > > > $encoding =3D mb_detect_encoding($contents,
> > > > "UTF-8,ISO-8859-1,WINDOWS-1252"); //,Windows-1255
> > > >
> > > > echo "||encoding:".$encoding."||";
> > > >
> > > > if ($encoding == 'UTF-8')
> > > > {
> > > > $utfcontents =3D $contents;
> > > > }
> > > > else if ($encoding == 'ISO-8859-1')
> > > > {
> > > > $utfcontents =3D utf8_encode($contents);
> > > > }
> > > >
> > > > var_dump($utfcontents);
> > > >
> > > > The $encoding is ISO-8859-1, the text file contains Hebrew characte=
rs,
> > > then
> > > > I'm converting it to utf8.
> > > >
> > > > The above code is outputing gibbrish, it seems that it has converte=
d it
> > > in
> > > > some way but not in the
> > > > proper way that it should have converted it.
> > >
> > > If you know that the file contains Hebrew, maybe you should
> > > try converting from ISO-8859-8?
> > >
> > >
> > > /Nisse
> > >
> > > --
> > > PHP General Mailing List (http://www.php.net/)
> > > To unsubscribe, visit: http://www.php.net/unsub.php
> > >
> > >
> >
>=20
>=20
>=20


A plain text file wouldn't have headers like that would it? At least,
not in the sense that an image file has a header, or an office document
file has a header.

Thanks,
Ash
http://www.ashleysheridan.co.uk



--=-pU3hg1VueIYQXDdw6EGd--

Re: Re: Detecting The Encoding Of A Text File

am 26.11.2009 14:39:04 von daniel danon

--001636c9274c75569c0479464ca2
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

If windows notepad can detect encoding there must be a way to do it
yourself.

Maybe try to get the file's headers, I think it should also contain the
encoding of the file...

2009/11/26 Nitsan Bin-Nun

> Someone have already suggested it but I haven't tried it yet.
>
> The thing is that right now it contains Hebrew, but tommorrow this file
> will
> be in German or any other accented language.
> I'm trying to create a function which would detect the encoding and conve=
rt
> it into UTF8.
>
> (I don't have much experience in encoding.. :( )
>
> 2009/11/26 Nisse Engström
>
> > On Thu, 26 Nov 2009 06:55:31 +0200, Nitsan Bin-Nun wrote:
> >
> > > Hi,
> > >
> > > I have been trying for the last couple of hours to determine the
> > > encoding of a text file (.txt in windowz).
> > >
> > > I have this code:
> > >
> > > $contents =3D file_get_contents($config['
> > > txt_dir'] . $file);
> > > $encoding =3D mb_detect_encoding($contents,
> > > "UTF-8,ISO-8859-1,WINDOWS-1252"); //,Windows-1255
> > >
> > > echo "||encoding:".$encoding."||";
> > >
> > > if ($encoding == 'UTF-8')
> > > {
> > > $utfcontents =3D $contents;
> > > }
> > > else if ($encoding == 'ISO-8859-1')
> > > {
> > > $utfcontents =3D utf8_encode($contents);
> > > }
> > >
> > > var_dump($utfcontents);
> > >
> > > The $encoding is ISO-8859-1, the text file contains Hebrew characters=
,
> > then
> > > I'm converting it to utf8.
> > >
> > > The above code is outputing gibbrish, it seems that it has converted =
it
> > in
> > > some way but not in the
> > > proper way that it should have converted it.
> >
> > If you know that the file contains Hebrew, maybe you should
> > try converting from ISO-8859-8?
> >
> >
> > /Nisse
> >
> > --
> > PHP General Mailing List (http://www.php.net/)
> > To unsubscribe, visit: http://www.php.net/unsub.php
> >
> >
>



--=20
Use ROT26 for best security

--001636c9274c75569c0479464ca2--

Re: Re: Detecting The Encoding Of A Text File

am 26.11.2009 15:49:26 von news.NOSPAM.0ixbtqKe

On Thu, 26 Nov 2009 15:39:04 +0200, דניאל דנון wrote:

> If windows notepad can detect encoding there must be a way to do it
> yourself.
>
> Maybe try to get the file's headers, I think it should also contain the
> encoding of the file...

Plain text files don't have any headers. Perhaps they use
heuristics, eg. examine the distribution of characters to
determine a probable encoding.

[Quick Google...]




/Nisse

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Re: Detecting The Encoding Of A Text File

am 26.11.2009 16:45:46 von daniel danon

--0016e68df2fea08f0b04794811b9
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I was thinking that if notepad can open it correctly it has headers - but
the link you gave clarify that, my bad.

2009/11/26 Nisse Engström

> On Thu, 26 Nov 2009 15:39:04 +0200, דניאל ד=
נון wrote:
>
> > If windows notepad can detect encoding there must be a way to do it
> > yourself.
> >
> > Maybe try to get the file's headers, I think it should also contain the
> > encoding of the file...
>
> Plain text files don't have any headers. Perhaps they use
> heuristics, eg. examine the distribution of characters to
> determine a probable encoding.
>
> [Quick Google...]
>
>
>
>
> /Nisse
>
> --
> PHP General Mailing List (http://www.php.net/)
> To unsubscribe, visit: http://www.php.net/unsub.php
>
>


--=20
Use ROT26 for best security

--0016e68df2fea08f0b04794811b9--