Strange encoding in a hotmail message.
Strange encoding in a hotmail message.
am 23.08.2005 18:00:22 von nonmais
Hi,
I received a mail from a friend, with an unknown encoding; most probably it
is not a valid encoding: instead the whole thing must have been wrongly
converted at some stage.
The friend usually sends russian text, encoded in KOI-ru, from her browser
with a hotmail (webmail) account. Looks like hotmail doesn't support UTF-8
at all :-/ but ok normally I can read it.
Now this time there was ONE character in the message that didn't fit in the
KOI character set ('ü', a german u with umlaut), which caused the whole
system to misbehave.
Here is what I got:
- us-ascii got their usual 1-byte ascii encoding
- russian characters got this 2-byte encoding:
russian capital a (unicode code point 0410): encoded as A7 A1
russian capital be (unicode code point 0411): encoded as A7 A2
etc...
russian small a (unicode code point 0430): encoded as A7 D1
etc...
- the 'ü' (u with umlaut) (unicode code point 00FC): encoded as A8 B9.
(all values given in hexadecimal).
Any idea what happened? It's definitely not UTF-7, nor UTF-8.
Thanks.
Re: Strange encoding in a hotmail message.
am 23.08.2005 18:47:02 von Kari Hurtta
nonmais writes:
> Hi,
> I received a mail from a friend, with an unknown encoding; most probably it
> is not a valid encoding: instead the whole thing must have been wrongly
> converted at some stage.
>
> The friend usually sends russian text, encoded in KOI-ru, from her browser
> with a hotmail (webmail) account. Looks like hotmail doesn't support UTF-8
> at all :-/ but ok normally I can read it.
>
> Now this time there was ONE character in the message that didn't fit in the
> KOI character set ('ü', a german u with umlaut), which caused the whole
> system to misbehave.
>
> Here is what I got:
> - us-ascii got their usual 1-byte ascii encoding
> - russian characters got this 2-byte encoding:
> russian capital a (unicode code point 0410): encoded as A7 A1
> russian capital be (unicode code point 0411): encoded as A7 A2
> etc...
> russian small a (unicode code point 0430): encoded as A7 D1
> etc...
> - the 'ü' (u with umlaut) (unicode code point 00FC): encoded as A8 B9.
> (all values given in hexadecimal).
>
> Any idea what happened? It's definitely not UTF-7, nor UTF-8.
> Thanks.
What was on mail headers? What they claimed about encoding?
Re: Strange encoding in a hotmail message.
am 23.08.2005 19:54:53 von nonmais
Kari Hurtta wrote:
> nonmais writes:
>
>> Hi,
>> I received a mail from a friend, with an unknown encoding; most probably
>> it is not a valid encoding: instead the whole thing must have been
>> wrongly converted at some stage.
>>
>> The friend usually sends russian text, encoded in KOI-ru, from her
>> browser with a hotmail (webmail) account. Looks like hotmail doesn't
>> support UTF-8 at all :-/ but ok normally I can read it.
>>
>> Now this time there was ONE character in the message that didn't fit in
>> the KOI character set ('ü', a german u with umlaut), which caused the
>> whole system to misbehave.
>>
>> Here is what I got:
>> - us-ascii got their usual 1-byte ascii encoding
>> - russian characters got this 2-byte encoding:
>> russian capital a (unicode code point 0410): encoded as A7 A1
>> russian capital be (unicode code point 0411): encoded as A7 A2
>> etc...
>> russian small a (unicode code point 0430): encoded as A7 D1
>> etc...
>> - the 'ü' (u with umlaut) (unicode code point 00FC): encoded as A8 B9.
>> (all values given in hexadecimal).
>>
>> Any idea what happened? It's definitely not UTF-7, nor UTF-8.
>> Thanks.
>
> What was on mail headers? What they claimed about encoding?
There wasn't anything at all about the encodings in the headers!
The content-type header was:
Content-Type: text/html; format=flowed
The whole message itself was in html (and there wasn't anything about the
encoding in the html tags or attributes).
It was sent by the webmail service of hotmail.msn.com. (Well anything could
happen then.)
I've already come to the conclusion that it was never encoded in unicode,
because the ordering of the letters is not the same as in unicode (in
russian there is a letter that looks like a 'e' with umlaut; sometimes it
is ranked after 'e', sometimes not). But all the other cyrillic encodings I
know are on one byte, not two.
Re: Strange encoding in a hotmail message.
am 24.08.2005 01:13:38 von Sam
This is a MIME GnuPG-signed message. If you see this text, it means that
your E-mail or Usenet software does not support MIME signed messages.
The Internet standard for MIME PGP messages, RFC 2015, was published in 1996.
To open this message correctly you will need to install E-mail or Usenet
software that supports modern Internet standards.
--=_mimegpg-commodore.email-scan.com-13423-1124838820-0002
Content-Type: text/plain; format=flowed; charset="US-ASCII"
Content-Disposition: inline
Content-Transfer-Encoding: 7bit
nonmais writes:
> Kari Hurtta wrote:
>
>>
>> What was on mail headers? What they claimed about encoding?
>
> There wasn't anything at all about the encodings in the headers!
> The content-type header was:
> Content-Type: text/html; format=flowed
> The whole message itself was in html (and there wasn't anything about the
> encoding in the html tags or attributes).
> It was sent by the webmail service of hotmail.msn.com. (Well anything could
> happen then.)
Looks like you'll just have to call Microsoft and complain about this bug in
their mail software.
Good luck.
--=_mimegpg-commodore.email-scan.com-13423-1124838820-0002
Content-Type: application/pgp-signature
Content-Transfer-Encoding: 7bit
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.6 (GNU/Linux)
iD8DBQBDC62kx9p3GYHlUOIRAiNKAJ9lgMyVWcRqWfDPnMpfZeLmuH9mXwCd Egwl
rgXBFFNjR1owZQE0mcI3g9Q=
=WroB
-----END PGP SIGNATURE-----
--=_mimegpg-commodore.email-scan.com-13423-1124838820-0002--
Re: Strange encoding in a hotmail message.
am 24.08.2005 05:56:45 von Dmitry Davletbaev
On 2005-08-23, nonmais wrote:
> [...]
> Any idea what happened? It's definitely not UTF-7, nor UTF-8.
> [...]
UTF-16? Almost all that Microsoft do with Unicode is UTF-16.
--
Dmitry Davletbaev
Re: Strange encoding in a hotmail message.
am 24.08.2005 17:16:14 von Mark Crispin
On Wed, 24 Aug 2005, Dmitry Davletbaev wrote:
> On 2005-08-23, nonmais wrote:
>> Any idea what happened? It's definitely not UTF-7, nor UTF-8.
> UTF-16? Almost all that Microsoft do with Unicode is UTF-16.
No, it's not UTF-16. UTF-16 looks like UCS-2 for characters inside the
BMP; basically, it is UCS-2 with the addition of the surrogate mechanism
to support 16 additional planes. ["16 additional planes" is not a typo;
there are *17* planes in Unicode.]
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Re: Strange encoding in a hotmail message.
am 24.08.2005 17:33:11 von Mark Crispin
On Tue, 23 Aug 2005, nonmais wrote:
> - us-ascii got their usual 1-byte ascii encoding
> - russian characters got this 2-byte encoding:
> russian capital a (unicode code point 0410): encoded as A7 A1
> russian capital be (unicode code point 0411): encoded as A7 A2
> etc...
> russian small a (unicode code point 0430): encoded as A7 D1
> etc...
> - the '$B(B' (u with umlaut) (unicode code point 00FC): encoded as A8 B9.
I figured it out. It's an EUC encoding of one of the East Asian character
sets: either JIS X 0208, GB 2312, or GB 12345.
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.
Re: Strange encoding in a hotmail message.
am 25.08.2005 19:21:25 von nonmais
Mark Crispin wrote:
> On Tue, 23 Aug 2005, nonmais wrote:
>> - us-ascii got their usual 1-byte ascii encoding
>> - russian characters got this 2-byte encoding:
>> russian capital a (unicode code point 0410): encoded as A7 A1
>> russian capital be (unicode code point 0411): encoded as A7 A2
>> etc...
>> russian small a (unicode code point 0430): encoded as A7 D1
>> etc...
>> - the '�' (u with umlaut) (unicode code point 00FC): encoded as A8 B9.
>
> I figured it out. It's an EUC encoding of one of the East Asian character
> sets: either JIS X 0208, GB 2312, or GB 12345.
>
> -- Mark --
>
> http://staff.washington.edu/mrc
> Science does not emerge from voting, party politics, or public debate.
> Si vis pacem, para bellum.
Thanks that was it: GB2312. I'm trying to find an explanation why it
displays cyrillic characters. I've only found informations on East-asian
characters with these encodings.
Re: Strange encoding in a hotmail message.
am 25.08.2005 20:20:57 von Mark Crispin
On Thu, 25 Aug 2005, nonmais wrote:
>> I figured it out. It's an EUC encoding of one of the East Asian character
>> sets: either JIS X 0208, GB 2312, or GB 12345.
> Thanks that was it: GB2312. I'm trying to find an explanation why it
> displays cyrillic characters. I've only found informations on East-asian
> characters with these encodings.
GB 2312 and GB 12345 includes codepoints for Cyrillic in its 7th row and
Latin with Western European diacriticals (including umlaut-u) in its 8th
row.
Apparently, the MUA saw that the message could not be represented in
KOI8-R, and promoted it to GB 2312 which can represent the message. I
don't know why it didn't promote it to UTF-8 (Unicode). My guess is that
the MUA is configured with Russian as a primary environment and Chinese as
a secondary environment, and all it cared in making that selection was
that the Chinese character set could represent the text.
Even so, it should have tagged the message with the proper character set.
However, it's not easy to pinpoint blame; it's possible that the message
left the MUA in proper form, but some intermediate MTA damaged it.
I misspoke about JIS X 0208. JIS X 0208 has Cyrillic in its 7th row, but
has box-drawing characters in its 8th row. JIS doesn't have Latin with
Western European diacriticals.
-- Mark --
http://staff.washington.edu/mrc
Science does not emerge from voting, party politics, or public debate.
Si vis pacem, para bellum.