How to detect text charset (UTF-8 or Latin-1)
How to detect text charset (UTF-8 or Latin-1)
am 15.01.2008 17:59:31 von tarmstrong
Hi.
I'm creating a Perl script extracting text from a webpage using LWP,
and want to check if text is UTF-8 or Latin-1 encoded?
Is there any known function? I don't know if "use utf8;" is enough
Thank you very much in advance.
Re: How to detect text charset (UTF-8 or Latin-1)
am 15.01.2008 18:22:09 von Lawrence Statton
Thomas Armstrong writes:
> Hi.
>
> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?
Considering that each and every series of UTF-8 octets is also a valid
(if nonsensical) series of Latin-1 octets, you are asking a question
that cannot be solved rigorously.
However, there are series of valid Latin-1 octets that are NOT
UTF-8, so that can be used as a heuristic guide.
Finally, you can devise a heuristic to give a confidence level that a
given series of octets is LIKELY to be UTF-8 or Latin-1 even in the
cases where they are valid in both codings.
The sad news is, I wrote one that works wonders, but the code is
encumbered.
--
Lawrence Statton - lawrenabae@abaluon.abaom s/aba/c/g
Computer software consists of only two components: ones and
zeros, in roughly equal proportions. All that is required is to
place them into the correct order.
Re: How to detect text charset (UTF-8 or Latin-1)
am 15.01.2008 18:45:03 von smallpond
On Jan 15, 11:59 am, Thomas Armstrong wrote:
> Hi.
>
> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?
>
> Is there any known function? I don't know if "use utf8;" is enough
>
> Thank you very much in advance.
Parse the Content-type header, for example:
content="text/html; charset=UTF-8"
Web pages that lie or omit the Content-type are not scarce,
unfortunately.
Re: How to detect text charset (UTF-8 or Latin-1)
am 15.01.2008 20:16:02 von Joe Smith
Thomas Armstrong wrote:
> I'm creating a Perl script extracting text from a webpage using LWP,
> and want to check if text is UTF-8 or Latin-1 encoded?
>
> I don't know if "use utf8;" is enough
It is not appropriate.
The pragma "use utf8" tells the Perl interpreter that your program file
contains strings encoded in UTF-8 format. It does *not* affect how
Perl handles data from external sources.
-Joe
Re: How to detect text charset (UTF-8 or Latin-1)
am 15.01.2008 21:53:30 von jurgenex
>Thomas Armstrong wrote:
>> I'm creating a Perl script extracting text from a webpage using LWP,
>> and want to check if text is UTF-8 or Latin-1 encoded?
Check the tag.
jue
Re: How to detect text charset (UTF-8 or Latin-1)
am 15.01.2008 23:11:23 von Joost Diepenmaat
Jürgen Exner writes:
>>Thomas Armstrong wrote:
>>> I'm creating a Perl script extracting text from a webpage using LWP,
>>> and want to check if text is UTF-8 or Latin-1 encoded?
>
> Check the tag.
>
> jue
or the xml prolog (if there is one)
Joost.