String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to apostrophe. Why?

am 18.07.2007 05:13:17 von Richard

This is re-posted from http://groups.google.com/group/JavaScript-Informatio=
n,
under the (poor) Subject "Script translated". It should have been
posted here in the first place.

I visited http://whytheluckystiff.net/articles/seeingMetaclassesClearl y.htm=
l,
which provides a neat tutorial about metaclasses in Ruby programming.

I particularly like the GUI the author created and want to emulate his
techniques. In particular, he used the (three character) string Ã¢=E2=
¬â=A2
(hex E2 80 99) which translated in ' (ASCII apostrophe) in both
Firefox 2 and HTML-Kit HTML-Kit Version 1.0 (Build 292). However, IE7
leaves it untranslated.

I presume the author coded the apostrophe this way was for
internationalization. But I don't see why this works in Firefox and
HTML-Kit. Can anyone explain why the following works in those two
browsers?

"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

Apostrophy Test.html

If youÃ¢â¬â¢re new to metaprogramming in Ruby

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to apostrophe. Why?

am 18.07.2007 14:23:53 von Harlan Messinger

Richard wrote:
> This is re-posted from http://groups.google.com/group/JavaScript-Information,
> under the (poor) Subject "Script translated". It should have been
> posted here in the first place.
>
> I visited http://whytheluckystiff.net/articles/seeingMetaclassesClearl y.html,
> which provides a neat tutorial about metaclasses in Ruby programming.
>
> I particularly like the GUI the author created and want to emulate his
> techniques. In particular, he used the (three character) string Ã¢â¬â¢
> (hex E2 80 99) which translated in ' (ASCII apostrophe) in both
> Firefox 2 and HTML-Kit HTML-Kit Version 1.0 (Build 292). However, IE7
> leaves it untranslated.

It isn't an ASCII apostrophe. It's Unicode U+201A, "SINGLE LOW-9
QUOTATION MARK".

http://www.fileformat.info/info/unicode/char/201a/index.htm

In my Firefox it looks like an apostrophe, but it shouldn't. If you want
an ASCII apostrophe, type an ASCII apostrophe. Guaranteed to work
regardless of the encoding.

>
> I presume the author coded the apostrophe this way was for
> internationalization. But I don't see why this works in Firefox and
> HTML-Kit. Can anyone explain why the following works in those two
> browsers?
>
>
[snip]

If the file is stored locally as ISO-8859-1, then Ã¢â¬â¢ are stored as E2
80 99. If these are the bytes that are sent to the web client, but the
client is told that the content is encoded as UTF-8, then it will treat
these bytes as such. In UTF-8, U+201A is encoded as E2 80 99. In any
event, there isn't any reason to do this. If this special character is
what you want, just use the hex character code ‚ or its decimal
equivalent, ‚.

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to a

am 18.07.2007 16:47:35 von jkorpela

Scripsit Richard:

> http://whytheluckystiff.net/articles/seeingMetaclassesClearl y.html,

The page declares UTF-8 encoding (in a meta tag only - not really ideal, but
it works), though it seems to use mostly just Ascii characters, representing
other characters using character references like ’. Nothing wrong with
that really, but the author is not making the best possible use of UTF-8.

> I particularly like the GUI the author created and want to emulate his
> techniques.

What GUI? I see no Graphic User Interface there. Just a web page. If you
view it using a graphic browser, then you are using a GUI, but that's a
different issue.

> In particular, he used the (three character) string Ã¢â¬â¢
> (hex E2 80 99) which translated in ' (ASCII apostrophe) in both
> Firefox 2 and HTML-Kit HTML-Kit Version 1.0 (Build 292).

What? Where? I don't see anything like that on the page.

> However, IE7 leaves it untranslated.

You're enigmatic.

> I presume the author coded the apostrophe this way was for
> internationalization.

The page has apostrophes written as ’, which is a correct reference,
and modern browsers render it well. They don't map it to ASCII apostrophe,
except perhaps if they need to work with an ASCII-only rendering situation.

> But I don't see why this works in Firefox and
> HTML-Kit.

I don't see what you mean by "this".

> Can anyone explain why the following works in those two
> browsers?

(HTML-Kit is an authoring tool, not a browser.)

>

If youÃ¢â¬â¢re new to metaprogramming in Ruby

Well it doesn't. The string Ã¢â¬â¢ is rendered literally, as a mess of
characters. Maybe the actual file you used for testing contains something
completely different, though. (As usual, posting a URL...)

You have some confusion here. You have probably played with a program that
converts character references to UTF-8 encoded characters and later you have
interpreted the octets of the UTF-8 representation according to theWindows
Latin 1 (windows-1252) encoding.

It's easy to get confused with character encodings, and difficult to help
people out from a confusion. It's probably best to stop here and start
afresh. What do you really want? To use a punctuation apostrophe (â) on a
web page? Then write ’. Or ’, if that's easier to remember.
There are other ways too, but these methods work independently of character
encoding and don't make you confused and don't require any particular editor
or UTF-8 support in your authoring software.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to apostrophe. Why?

am 19.07.2007 00:22:04 von dorayme

In article ,
"Jukka K. Korpela" wrote:

> It's easy to get confused with character encodings, and difficult to help
> people out from a confusion.

Why do you think this is? Is it because it is a _very
complicated_ subject?

--
dorayme

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated t

am 19.07.2007 12:38:02 von jkorpela

Scripsit dorayme:

> In article ,
> "Jukka K. Korpela" wrote:
>
>> It's easy to get confused with character encodings, and difficult to
>> help people out from a confusion.
>
> Why do you think this is?

From my experience with myself and other people.

> Is it because it is a _very complicated_ subject?

Partly, but also because there is so much misinformation about it and
because people _think_ they know something about the issue.

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to apostrophe. Why?

am 19.07.2007 16:48:10 von Richard

On Jul 18, 8:23 am, Harlan Messinger
wrote:
> Richard wrote:
> > This is re-posted fromhttp://groups.google.com/group/JavaScript-Informa=
tion,
> > under the (poor) Subject "Script translated". It should have been
> > posted here in the first place.
>
> > I visitedhttp://whytheluckystiff.net/articles/seeingMetaclasse sClearly.=
html,
> > which provides a neat tutorial about metaclasses in Ruby programming.
>
> > I particularly like the GUI the author created and want to emulate his
> > techniques. In particular, he used the (three character) string Ã¢=
â¬â¢
> > (hex E2 80 99) which translated in ' (ASCII apostrophe) in both
> > Firefox 2 and HTML-Kit HTML-Kit Version 1.0 (Build 292). However, IE7
> > leaves it untranslated.
>
> It isn't an ASCII apostrophe. It's Unicode U+201A, "SINGLE LOW-9
> QUOTATION MARK".
>
> http://www.fileformat.info/info/unicode/char/201a/index.htm
>
> In my Firefox it looks like an apostrophe, but it shouldn't. If you want
> an ASCII apostrophe, type an ASCII apostrophe. Guaranteed to work
> regardless of the encoding.
>
>
>
> > I presume the author coded the apostrophe this way was for
> > internationalization. But I don't see why this works in Firefox and
> > HTML-Kit. Can anyone explain why the following works in those two
> > browsers?
>
> >
>
> [snip]
>
> If the file is stored locally as ISO-8859-1, then Ã¢â¬â=84=
=A2 are stored as E2
> 80 99. If these are the bytes that are sent to the web client, but the
> client is told that the content is encoded as UTF-8, then it will treat
> these bytes as such. In UTF-8, U+201A is encoded as E2 80 99. In any
> event, there isn't any reason to do this. If this special character is
> what you want, just use the hex character code ‚ or its decimal
> equivalent, ‚.

Hi Harlan,

Thank you for your thoughtful response. With the benefit of your
suggestions, I came to realize that the problem was my using a screen-
capture of web page. Following are details of my new understanding:

>> I particularly like the GUI the author created and want to emulate his
>> techniques. In particular, he used the (three character) string Ã¢?T
>> (hex E2 80 99) which translated in ' (ASCII apostrophe) in both
>> Firefox 2 and HTML-Kit HTML-Kit Version 1.0 (Build 292). However, IE7
>> leaves it untranslated.
>
> It isn't an ASCII apostrophe. It's Unicode U+201A, "SINGLE LOW-9 QUOTATION
> MARK".
>
> http://www.fileformat.info/info/unicode/char/201a/index.htm

Thank you very much for this link. I like it's explantion of various
encoding schemes.

When I first saw this Ã¢?T string on the subject web page, I saved
the
page's image as a text file and opened that file in a hex editor.
That's
how I discovered their hex representation E2 80 99.

Now that you mention Unicode encoding, I went back and looked at the
web
page's source. It shows that the author coded the entity ’,
which
yielded that apostrophe. When I search for that entity, it asks "Are
you
looking for Unicode character U+2019: RIGHT SINGLE QUOTATION MARK?".
That
makes sense because 0x2019 == 8217. So, the author was trying to
present a
"high-class" apostrophe rather the common-place 0x27 ('). Finally,
now
that I look at that web page more carefully, I see that it *does*
present
the high-class apostrophe.

The last piece of the puzzle is that when I capture the textual
representation of the webpage with a "screen grabber" (SnagIt), it
somehow
comes up with that weird three character string.

The bottom line is that I violated the implicit boundaries between
various
tools and technologies. I apologize for that ;-)

Again, thank you for giving me some way to understand what was going
on
here, and especially for that helpful link.
--
Richard

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to apostrophe. Why?

am 19.07.2007 17:32:49 von Richard

On Jul 18, 10:47 am, "Jukka K. Korpela" wrote:
> Scripsit Richard:
>
> >http://whytheluckystiff.net/articles/seeingMetaclassesClear ly.html,
>
> The page declares UTF-8 encoding (in a meta tag only - not really ideal, =
but
> it works), though it seems to use mostly just Ascii characters, represent=
ing
> other characters using character references like ’. Nothing wrong w=
ith
> that really, but the author is not making the best possible use of UTF-8.
>
> > I particularly like the GUI the author created and want to emulate his
> > techniques.
>
> What GUI? I see no Graphic User Interface there. Just a web page. If you
> view it using a graphic browser, then you are using a GUI, but that's a
> different issue.
>
> > In particular, he used the (three character) string Ã¢â¬â=
¢
> > (hex E2 80 99) which translated in ' (ASCII apostrophe) in both
> > Firefox 2 and HTML-Kit HTML-Kit Version 1.0 (Build 292).
>
> What? Where? I don't see anything like that on the page.
>
> > However, IE7 leaves it untranslated.
>
> You're enigmatic.
>
> > I presume the author coded the apostrophe this way was for
> > internationalization.
>
> The page has apostrophes written as ’, which is a correct reference,
> and modern browsers render it well. They don't map it to ASCII apostrophe,
> except perhaps if they need to work with an ASCII-only rendering situatio=
n
>
> > But I don't see why this works in Firefox and
> > HTML-Kit.
>
> I don't see what you mean by "this".
>
> > Can anyone explain why the following works in those two
> > browsers?
>
> (HTML-Kit is an authoring tool, not a browser.)
>
> >

If youÃ¢â¬â¢re new to metaprogramming in Ruby<=
/p>
>
> Well it doesn't. The string Ã¢â¬â¢ is rendered literall=
y, as a mess of
> characters. Maybe the actual file you used for testing contains something
> completely different, though. (As usual, posting a URL...)
>
> You have some confusion here. You have probably played with a program that
> converts character references to UTF-8 encoded characters and later you h=
ave
> interpreted the octets of the UTF-8 representation according to theWindows
> Latin 1 (windows-1252) encoding.
>
> It's easy to get confused with character encodings, and difficult to help
> people out from a confusion. It's probably best to stop here and start
> afresh. What do you really want? To use a punctuation apostrophe (â=
=99) on a
> web page? Then write ’. Or ’, if that's easier to remember.
> There are other ways too, but these methods work independently of charact=
er
> encoding and don't make you confused and don't require any particular edi=
tor
> or UTF-8 support in your authoring software.
>
> --
> Jukka K. Korpela ("Yucca")http://www.cs.tut.fi/~jkorpela/

Hi Yucca,

Thanks for your response. Please take a look at my response to Harlan
confessing that the fundamental problem was mine: I took a screen-
capture of a web page that contained an HTML entity, and that's what
apparently introduced that weird three-letter string.

> What GUI? I see no Graphic User Interface there. Just a web page.
That's true. What I meant is I admired the presentation of this
tutorial. I'm a retired software developer who's done a lot of
teaching of computer technology, e.g about a dozen years as a college
adjunct lecturer/professor. So I wanted to learn how he achieved some
of his visual effects or graphical effects or, in a sort of short-hand
GUI.

> Nothing wrong with
> that really, but the author is not making the best possible use of UTF-8.

I'm interested in your assessment. I never really studied these
various encoding schemes. I just picked up a few cryptic lines from
W3C to stick on the top of my HTML without giving it much thought.
In this case, do you think the author should have used something other
than UTF-8 because his page was pure ASCII, save for the HTML
entity? Or do you think the author should have employed other
features supported by UTF-8?

> > In particular, he used the (three character) string Ã¢â¬â=
¢
> > (hex E2 80 99) which translated in ' (ASCII apostrophe) in both
> > Firefox 2 and HTML-Kit HTML-Kit Version 1.0 (Build 292).
>
> What? Where? I don't see anything like that on the page.

You're right.. I was pretty clumsy here in trying to describe this
mess.

> > However, IE7 leaves it untranslated.
>
> You're enigmatic.

Yep. I was wrong here, too. Actually, I got a curly apostrophe
using all three tools.

> > I presume the author coded the apostrophe this way was for
> > internationalization.
>
> The page has apostrophes written as ’, which is a correct reference,
> and modern browsers render it well. They don't map it to ASCII apostrophe,
> except perhaps if they need to work with an ASCII-only rendering situatio=
n

I now see that I misinterpreted the scenario. I was wasnâ=99t asking =
why
the browsers didnâ=99t render that entity as an ASCII apostrophe (0x27=
)
Instead, I incorrectly thought the browsers rendered the entity as an
ASCII apostrophe, and I was asking why the author employed an entity
rather than merely using an ASCII apostrophe directly. But, as I said
to Harlan, the author wanted a closing single quote and that, in
fact, is what Firefox and IE browsers rendered, as did HTML-Kitâ=99s
(AFIK, built-in) interpreter.

> Maybe the actual file you used for testing contains something
> completely different, though. (As usual, posting a URL...)
>
> You have some confusion here. You have probably played with a program that
> converts character references to UTF-8 encoded characters and later you h=
ave
> interpreted the octets of the UTF-8 representation according to theWindows
> Latin 1 (windows-1252) encoding.

Youâ=99re right. I was using the file resulting from screen-capture of
the rendered web page. Sorry about that. ï=8C

Again, thanks for taking all the trouble to figure out what I was
confused about.

Best wishes,
Richard

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to a

am 19.07.2007 18:59:15 von jkorpela

Scripsit Richard:

> In this case, do you think the author should have used something other
> than UTF-8 because his page was pure ASCII,

He could have used US-ASCII, ISO-8859-1, or even windows-1252. Now that
support to UTF-8 in web browsers is rather universal, it doesn't matter much
that you declare it even in cases where your data is really ASCII. On the
other hand, declaring UTF-8 even if you really use ASCII used to help
Netscape 4 into doing the right thing with entity references. But that's
past winter's snow now. What remains is that people who read the source code
get puzzled with the declaration of UTF-8.

> save for the HTML entity?

The entity reference doesn't make the data non-ASCII: it consists of ASCII
characters that represent, by a special convention, a particular character.

> Or do you think the author should have employed other
> features supported by UTF-8?

I'm not saying he _should_, just that it puzzles me why he didn't. (I can
imagine some reasons, like editors that cannot handle real UTF-8.) If your
authoring software supports UTF-8, wouldn't you want to see a real
apostrophe instead of a cryptic reference like ’ when you read the
source code?

--
Jukka K. Korpela ("Yucca")
http://www.cs.tut.fi/~jkorpela/

Re: String "ÃƒÂ¢Ã¢Â‚Â¬Ã¢Â„Â¢" translated to apostrophe. Why?

am 20.07.2007 01:24:02 von Richard

On Jul 19, 12:59 pm, "Jukka K. Korpela" wrote:
> Scripsit Richard:
>
> > In this case, do you think the author should have used something other
> > than UTF-8 because his page was pure ASCII,
>
> He could have used US-ASCII, ISO-8859-1, or even windows-1252. Now that
> support to UTF-8 in web browsers is rather universal, it doesn't matter much
> that you declare it even in cases where your data is really ASCII. On the
> other hand, declaring UTF-8 even if you really use ASCII used to help
> Netscape 4 into doing the right thing with entity references. But that's
> past winter's snow now. What remains is that people who read the source code
> get puzzled with the declaration of UTF-8.
>
> > save for the HTML entity?
>
> The entity reference doesn't make the data non-ASCII: it consists of ASCII
> characters that represent, by a special convention, a particular character.
>
> > Or do you think the author should have employed other
> > features supported by UTF-8?
>
> I'm not saying he _should_, just that it puzzles me why he didn't. (I can
> imagine some reasons, like editors that cannot handle real UTF-8.) If your
> authoring software supports UTF-8, wouldn't you want to see a real
> apostrophe instead of a cryptic reference like ’ when you read the
> source code?
>
> --
> Jukka K. Korpela ("Yucca")http://www.cs.tut.fi/~jkorpela/

Hey Yukka,

Thanks for your further comments. I feel a lot smarter today than I
did yesterday :-)

Best wishes,
Richard