How to detect text charset (UTF-8 or Latin-1)

How to detect text charset (UTF-8 or Latin-1)

am 15.01.2008 17:59:31 von tarmstrong

Hi.

I'm creating a Perl script extracting text from a webpage using LWP,
and want to check if text is UTF-8 or Latin-1 encoded?

Is there any known function? I don't know if "use utf8;" is enough

Thank you very much in advance.

Re: How to detect text charset (UTF-8 or Latin-1)

am 15.01.2008 18:22:09 von Lawrence Statton

Thomas Armstrong writes:
> Hi.
>
> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?

Considering that each and every series of UTF-8 octets is also a valid
(if nonsensical) series of Latin-1 octets, you are asking a question
that cannot be solved rigorously.

However, there are series of valid Latin-1 octets that are NOT
UTF-8, so that can be used as a heuristic guide.

Finally, you can devise a heuristic to give a confidence level that a
given series of octets is LIKELY to be UTF-8 or Latin-1 even in the
cases where they are valid in both codings.

The sad news is, I wrote one that works wonders, but the code is
encumbered.

--
Lawrence Statton - lawrenabae@abaluon.abaom s/aba/c/g
Computer software consists of only two components: ones and
zeros, in roughly equal proportions. All that is required is to
place them into the correct order.

Re: How to detect text charset (UTF-8 or Latin-1)

am 15.01.2008 18:45:03 von smallpond

On Jan 15, 11:59 am, Thomas Armstrong wrote:
> Hi.
>
> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?
>
> Is there any known function? I don't know if "use utf8;" is enough
>
> Thank you very much in advance.


Parse the Content-type header, for example:
content="text/html; charset=UTF-8"

Web pages that lie or omit the Content-type are not scarce,
unfortunately.

Re: How to detect text charset (UTF-8 or Latin-1)

am 15.01.2008 20:16:02 von Joe Smith

Thomas Armstrong wrote:

> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?
>
> I don't know if "use utf8;" is enough

It is not appropriate.

The pragma "use utf8" tells the Perl interpreter that your program file
contains strings encoded in UTF-8 format. It does *not* affect how
Perl handles data from external sources.

-Joe

Re: How to detect text charset (UTF-8 or Latin-1)

am 15.01.2008 21:53:30 von jurgenex

>Thomas Armstrong wrote:
>> I'm creating a Perl script extracting text from a webpage using LWP,
>> and want to check if text is UTF-8 or Latin-1 encoded?

Check the tag.

jue

Re: How to detect text charset (UTF-8 or Latin-1)

am 15.01.2008 23:11:23 von Joost Diepenmaat

Jürgen Exner writes:

>>Thomas Armstrong wrote:
>>> I'm creating a Perl script extracting text from a webpage using LWP,
>>> and want to check if text is UTF-8 or Latin-1 encoded?
>
> Check the tag.
>
> jue

or the xml prolog (if there is one)

Joost.