Odd character display / UTF issue ?
am 15.10.2007 05:08:19 von unknownPost removed (X-No-Archive: yes)
Post removed (X-No-Archive: yes)
Scripsit still me:
> I am working with a simple Paypal shopping cart and having an issue
> with an odd character.
Since you don't include a URL, you're effectively asking for a lump of
guesses.
> In the case of the call from the CGI program, I see a few
> funky  characters displayed on screen.
You have some error in your code.
> The code that is causing them
> is easy to find in the source:
>
>
On 15 Oct, 05:50, "Jukka K. Korpela"
> Sounds like character encoding confusion. Anything that _looks_ like "? " is
> probably something UTF-8 encoded (or distorted UTF-8) interpreted by some
> 8-bit encoding.
No, characters in a UTF-8 encoding interpreted by a tool using non-
UTF-8 encoding will generally generate garbage characters that are
still displayable (the tool thinks that it received two good
characters, they just don't mean anything). Typically it's a pair of
characters, the first of these is some variant of an accented
"A" (they won't all be, but if you see lots of spurious "A"s on a
page, look to UTF-8).
To get the unrecognizable character "?" displayed, then your tool must
have been able to automatically recognise garbage, i.e. bad encodings,
not just bad characters. This usually indicates non UTF-8 characters
being served as UTF-8, then the tool being unable to process them as
UTF-8. As ASCII is also simultaneously UTF-8 and ISO-8859-*, this is
caused (most likely) by non-ASCII characters with ISO-8859-* encodings
and a UTF-8 content-type.
Scripsit Andy Dingley:
> On 15 Oct, 05:50, "Jukka K. Korpela"
>> Sounds like character encoding confusion. Anything that _looks_ like
>> "? " is probably something UTF-8 encoded (or distorted UTF-8)
>> interpreted by some 8-bit encoding.
>
> No, characters in a UTF-8 encoding interpreted by a tool using non-
> UTF-8 encoding will generally generate garbage characters that are
> still displayable
That's what I wrote about, using the (iso-8859-1 encoded) character Â
(letter A with circumflex accent) as in the original question. I wonder what
piece of software munged it, but it wasn't anything I was using.
> (the tool thinks that it received two good
> characters, they just don't mean anything).
Two, three or four.
> Typically it's a pair of
> characters, the first of these is some variant of an accented
> "A"
Yes, at least when the 8-bit encoding is ISO-8859-1.
The combination "Â " also indicates some other error, since the octet
combination C2 20 must not appear in UTF-8 encoded data. We have little way
of knowing what happened, but I'd guess that 20 (which looks like space when
interpreted according to ISO-8859-1) was some octet in the range 80..9F,
maybe something that isn't allocated in windows-1252.
> To get the unrecognizable character "?" displayed,
Which unrecognizable "?"? The question mark is recognizable, and so is the
character "Â", which is what was actually included in the original question.
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/
Post removed (X-No-Archive: yes)
Scripsit still me:
> Any thoughts as to what would cause these characters to display on
> screen in one case but not in the other?
Yeah, you're doing something wrong.
To get more specific help, post a more specific question. That means the
URL. It might not be enough - we might need the CGI script code as well -
but it would be a start. Without a URL, this is basically just babbling, and
not even particularly entertaining.
--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/