How can I become universal utf/unicode

am 14.09.2004 01:21:36 von j_and_t

I don't know where else to post this question.

I'm already using LWP::UserAgent and HTML::Parser and successfully fetch and
parse documents without problem. However, I would like to be universal. I'm
using Perl 5.8.3 with the latest HTML::Parser as of today.

Sometimes when fetching a document you have no idea the encoding and
sometimes you do. What I want to know is how do I convert the incoming Web
page regardless of encoding to UTF-8 as well as encode entities to something
like Aacute (for keyword matching)?

Maybe I'm stupid because I've tried everything I can think of as well as
following some examples I've found and no matter what I do, it just doesn't
work.

Any help would be appreciated.

Thanks,
John

____________________________________________________________ _____
Check out Election 2004 for up-to-date election news, plus voter tools and
more! http://special.msn.com/msn/election2004.armx

Re: How can I become universal utf/unicode

am 14.09.2004 02:24:50 von derhoermi

* J and T wrote:
>Sometimes when fetching a document you have no idea the encoding and
>sometimes you do. What I want to know is how do I convert the incoming Web
>page regardless of encoding to UTF-8 as well as encode entities to something
>like Aacute (for keyword matching)?

You need to determine the character encoding of the document and then
transcode the byte stream to from the determined encoding to UTF-8.
There are a number of rules how to determine the character encoding of
text/html resources, these are unfortunately underspecified and contra-
dict each other and, worse, most documents do not have any encoding
information which means you would have to "guess" an encoding, or are
encoded using a different encoding than what they declare, in these
cases you would need to either reject the document or attempt to recover
from such problems.

There is a HTML::Encoding module on CPAN that can help you to determine
the encoding, but there are probably some bugs and the interface will
most certainly change once I get around to look at it again (I haven't
done so for years). It should however give a good starting point. If
that module (or similar code) does not yield in encoding information,
there is Encode::Guess which helps a bit to determine the encoding. More
sophisticated solutions than Encode::Guess are, AFAICT, not available on
CPAN. You could try to interface with or reuse code from some web
browsers, MSHTML for example would perform byte pattern analysis to
determine an encoding. A simpler approach would be to fallback to e.g.
Windows-1252, what you would do depends on how good you would like the
results to be. Over at the W3C Markup Validator we currently attempt to
use information as HTML::Encoding would report it and if that fails,
fall back to UTF-8 and if the document is not decodable as UTF-8, the
document is rejected. Which means that lots of documents are rejected.

Once the input is UTF-8 encoded, you can use HTML::Parser as usual. I
am not sure whether it sets the UTF-8 flag, but either way, it should
report the data in the same encoding so you could set the flag later.