Website encoding

Website encoding

am 18.11.2004 00:33:23 von Rick Measham

Content-Type: text/plain
Content-Transfer-Encoding: quoted-printable

Please forgive this going to both lists but I'm not sure where things
are going wrong...

I have many website around the world that I need to index. They're
straight HTML pages rather than perl-served and thus the headers say the
content-type is 'text/html' .. without mentioning the encoding.

The source of said pages has a meta-tag that sets the charset:

The page then contains text in the language of its author.

I have several problems (or really one with multiple questions)

The task is to retrieve the title, meta description and meta keywords,
store them in a mysql (4.1) database and then later retrieve them and
put them all on one page.

My thought process is to convert them into utf8 and store that in the
database. Then it's just a case of retrieving them later and outputting
them all on one page marked as utf8.

That being the case, I grab the charset and use Encode's decode function
to turn it into 'perl's internal format' .. which in 5.8.5 is utf8
right? I then store that in the db.

However it's not working.

Does that mean that the encoding of the actual characters on the page is
not in the charset in the meta tag? Or am I missing some piece of the

A random example page would be=20

This page is in German and *says* the charset it ISO-8859-1. However the
characters with the umlauts are displaying as unknown chars in a page
tagged as utf8.

(You can see the result at )

What am I doing wrong? Please help me someone!

Rick Measham

Content-Type: application/pgp-signature; name=signature.asc
Content-Description: This is a digitally signed message part

Version: GnuPG v1.2.4 (GNU/Linux)

iD8DBQBBm9/DuzM+k2id50IRAoWbAJ9bQhae7Wl/jRmBjJS2W5KmAONl7wCg qUEq
