Odd character display / UTF issue ?

Odd character display / UTF issue ?

am 15.10.2007 05:08:19 von unknown

Post removed (X-No-Archive: yes)

Re: Odd character display / UTF issue ?

am 15.10.2007 06:50:09 von jkorpela

Scripsit still me:

> I am working with a simple Paypal shopping cart and having an issue
> with an odd character.

Since you don't include a URL, you're effectively asking for a lump of
guesses.

> In the case of the call from the CGI program, I see a few
> funky  characters displayed on screen.

You have some error in your code.

> The code that is causing them
> is easy to find in the source:
>
>
> Â
> Â

Excellent! Now remove those funky characters! Problem solved.

> Here's the strange part: the exact same characters appear in the
> source that returns from the regular FORM/SUBMIT, yet the characters
> don't appear in either browser with the FORM/POST.

Sounds like character encoding confusion. Anything that _looks_ like "Â " is
probably something UTF-8 encoded (or distorted UTF-8) interpreted by some
8-bit encoding.

> HTTP:
> Content-Type: text/html; charset=UTF-8
> HTML:
>

Now that might be relevant, but is this really the case? Which encoding do
browsers actually use in interpreting the data?

That is, why don't you include some URLs so that the actual HTTP headers as
well as page content can be analyzed?

> Any thoughts? Is this a character set issue? I could change the header
> before I issue the page from the CGI of there is a character set that
> would work better.

If you are thinking about "trying" different charset parameters, you surely
have an issue with character sets. What is the _actual_ encoding of the
pages? That is, the encoding of the data itself, as opposite to what headers
or meta tags say about it.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: Odd character display / UTF issue ?

am 15.10.2007 12:18:38 von Andy Dingley

On 15 Oct, 05:50, "Jukka K. Korpela" wrote:
> Sounds like character encoding confusion. Anything that _looks_ like "? " is
> probably something UTF-8 encoded (or distorted UTF-8) interpreted by some
> 8-bit encoding.

No, characters in a UTF-8 encoding interpreted by a tool using non-
UTF-8 encoding will generally generate garbage characters that are
still displayable (the tool thinks that it received two good
characters, they just don't mean anything). Typically it's a pair of
characters, the first of these is some variant of an accented
"A" (they won't all be, but if you see lots of spurious "A"s on a
page, look to UTF-8).

To get the unrecognizable character "?" displayed, then your tool must
have been able to automatically recognise garbage, i.e. bad encodings,
not just bad characters. This usually indicates non UTF-8 characters
being served as UTF-8, then the tool being unable to process them as
UTF-8. As ASCII is also simultaneously UTF-8 and ISO-8859-*, this is
caused (most likely) by non-ASCII characters with ISO-8859-* encodings
and a UTF-8 content-type.

Re: Odd character display / UTF issue ?

am 15.10.2007 16:34:11 von jkorpela

Scripsit Andy Dingley:

> On 15 Oct, 05:50, "Jukka K. Korpela" wrote:
>> Sounds like character encoding confusion. Anything that _looks_ like
>> "? " is probably something UTF-8 encoded (or distorted UTF-8)
>> interpreted by some 8-bit encoding.
>
> No, characters in a UTF-8 encoding interpreted by a tool using non-
> UTF-8 encoding will generally generate garbage characters that are
> still displayable

That's what I wrote about, using the (iso-8859-1 encoded) character Â
(letter A with circumflex accent) as in the original question. I wonder what
piece of software munged it, but it wasn't anything I was using.

> (the tool thinks that it received two good
> characters, they just don't mean anything).

Two, three or four.

> Typically it's a pair of
> characters, the first of these is some variant of an accented
> "A"

Yes, at least when the 8-bit encoding is ISO-8859-1.

The combination "Â " also indicates some other error, since the octet
combination C2 20 must not appear in UTF-8 encoded data. We have little way
of knowing what happened, but I'd guess that 20 (which looks like space when
interpreted according to ISO-8859-1) was some octet in the range 80..9F,
maybe something that isn't allocated in windows-1252.

> To get the unrecognizable character "?" displayed,

Which unrecognizable "?"? The question mark is recognizable, and so is the
character "Â", which is what was actually included in the original question.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: Odd character display / UTF issue ?

am 15.10.2007 16:53:50 von unknown

Post removed (X-No-Archive: yes)

Re: Odd character display / UTF issue ?

am 18.10.2007 21:59:14 von jkorpela

Scripsit still me:

> Any thoughts as to what would cause these characters to display on
> screen in one case but not in the other?

Yeah, you're doing something wrong.

To get more specific help, post a more specific question. That means the
URL. It might not be enough - we might need the CGI script code as well -
but it would be a start. Without a URL, this is basically just babbling, and
not even particularly entertaining.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/