Re: Using SSI to include a UTF-8 encoded file causesa strange character to be sent to the browser

Re: Using SSI to include a UTF-8 encoded file causesa strange character to be sent to the browser

am 07.10.2009 13:07:16 von Chris Biggs

Hi André,

Firstly, thank you very much for your email - the speed with which you resp=
onded is much appreciated.=20

I am using Notepad purely to simplify and focus on the problem at hand. The=
actual HTML files are created from a Web Publishing system that uses XML a=
nd XSL. The user populates the XML via an Applet and when they save the fil=
e it is automatically transformed using the XSL into HTML. These final page=
s exhibit the same problem I have described when using Notepad.

And yes, the .shtml file does include the Meta tag you describe! =20

Regards
Christopher Biggs

----- Original Message -----
From: "André Warnier"
To: users@httpd.apache.org
Sent: Wednesday, 7 October, 2009 09:55:33 GMT +00:00 GMT Britain, Ireland, =
Portugal
Subject: Re: [users@httpd] Using SSI to include a UTF-8 encoded file causes=
a strange character to be sent to the browser

Hi.

Chris Biggs wrote:
....
> When these files are saved as "ANSI" (using Notepad)=20
(or rather in this case, as UTF-8)

Tips :
1) *don't use Notepad to edit HTML pages*. Use a real editor, properly=20
aware of character sets and encodings, and which will highlight=20
incorrect UTF-8 characters.
Notepad has a big problem when saving UTF-8 encoded files : it writes a=20
"BOM" at the beginning of the file, which is not only totally=20
unnecessary for UTF-8, but also confuses other programs.
A BOM is a sequence of 2 or 3 bytes, meant in some cases to indicate the=20
"byte order" of the file that follows.
For UTF-8, there is only one valid byte order, so the BOM is not=20
necessary and could/should be ignored.
However, when such a file with a BOM prefix is being included by some=20
software in the middle of another file (as you do with SSI), it usually=20
causes the kind of problem you are seeing : "bizarre" characters in the=20
middle.
2) use a proper charset=3DUTF-8" /> in the section of your html files. That should=
=20
tell the browser what the encoding of the page is.
3) But this is really only a substitute for the real standard-conformant=20
way of indicating the encoding to the browser : the webserver should=20
send, with each html page, a HTTP header like :
Content-type: text/html; charset=3DUTF-8
Unfortunately, MS's IE (all versions and sub-versions) have a long=20
history of ignoring or misinterpreting this part of the HTTP RFC, and=20
deciding themselves what content the document has.
This is *wrong*, but unfortunately also, in the real world IE is much=20
used, so one has to learn to work around this.


------------------------------------------------------------ ---------
The official User-To-User support forum of the Apache HTTP Server Project.
See for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org


------------------------------------------------------------ ---------
The official User-To-User support forum of the Apache HTTP Server Project.
See for more info.
To unsubscribe, e-mail: users-unsubscribe@httpd.apache.org
" from the digest: users-digest-unsubscribe@httpd.apache.org
For additional commands, e-mail: users-help@httpd.apache.org