Re: svn commit: r567258 - in /jakarta/site: docs/ docs/site/ docs/site/downloads/ docs/site/news/ do

Re: svn commit: r567258 - in /jakarta/site: docs/ docs/site/ docs/site/downloads/ docs/site/news/ do

am 19.08.2007 12:22:59 von sebb

On 19/08/07, Roland Weber wrote:
> sebb wrote:
> > Is there a way to fix build.xml so that the user's default encoding
> > does not affect the output? Or perhaps we could add a check and warn
> > if the encoding is wrong?
> >
> > The xml source files are already flagged as ISO-8859-1, as is the
> > stylesheet, which uses output encoding ISO-8859-1 as well, which one
> > might have hoped would be enough...
>
> I don't know what the exact symptoms of the problem are.

Here is a sample diff:

http://svn.apache.org/viewvc/jakarta/site/docs/site/news/200 206.html?r1=567256&r2=567257

The u-umlaut characters were replaced by ?

[But I don't know exactly how the mangled version was generated.]

> This is what the XSLT spec says about output encodings [1]:
>
> > The encoding attribute specifies the preferred encoding to use for
> > outputting the result tree. XSLT processors are required to respect
> > values of UTF-8 and UTF-16. For other values, if the XSLT processor
> > does not support the specified encoding it may signal an error; if
> > it does not signal an error it should use UTF-8 or UTF-16 instead.

Ah, thanks - that could well explain the problem.

> Is the output generated in UTF-8 or UTF-16? Then the solution
> would be to use one of those as the output encoding, since only
> those are required to be supported on all platforms.

The output is currently generated in iso-8859-1 (or iso-8859-15); the
input is specified using either an actual u-umlaut, or ü

Unfortunately changing to UTF-8 would mean changing all the html files...

I'll see about adding a check - should be easy enough to generate a
dummy html file from an xml containing some accented characters and
check that the result is as expected.

> cheers,
> Roland
>
> [1] http://www.w3.org/TR/xslt#section-XML-Output-Method
>
> ------------------------------------------------------------ ---------
> To unsubscribe, e-mail: general-unsubscribe@jakarta.apache.org
> For additional commands, e-mail: general-help@jakarta.apache.org
>
>

Re: svn commit: r567258

am 19.08.2007 13:37:31 von Roland Weber

Hi Sebastian,

> The u-umlaut characters were replaced by ?
>
> [But I don't know exactly how the mangled version was generated.]
>
> The output is currently generated in iso-8859-1 (or iso-8859-15); the
> input is specified using either an actual u-umlaut, or ü

That's a nasty one to track down. Apart from encoding specs in
the style sheet, there's also the encoding in the line
of the source file to consider. The source file specifies
ISO-8859-1. I wonder whether svn might screw up the charset
on co/ci. Isn't there also a tool that does some postprocessing
in order to normalize the XML? If an XML processor generates
UTF instead of the specified ISO-8859-1, and the next processor
expects ISO-* as input, the data could get screwed up. You'd
have to chase all the chain from input to final output.

> I'll see about adding a check - should be easy enough to generate a
> dummy html file from an xml containing some accented characters and
> check that the result is as expected.

That's probably the best approach.

cheers,
Roland

Re: svn commit: r567258

am 19.08.2007 13:58:29 von Roland Weber

The JDK version used may also have to do with it:
http://issues.apache.org/bugzilla/show_bug.cgi?id=38781

cheers,
Roland