Re: [XML::Simple-2.12] problems parsing non ASCII strings

Re: [XML::Simple-2.12] problems parsing non ASCII strings

am 12.07.2005 19:16:53 von Michel Rodriguez

Jul wrote:
> module: XML::Simple-2.12 (also tried 2.14)
> perl version: 5.00503

Wahouh! Do you know how old this is? 5, 6 years old?

> I need to parse and write a XML configuration file wich contains
> non-ASCII caraters (like 'é', in french).
> I've choosen, XML::Simple with XML::Parser for these tasks, but everything
> works fine if and only if I do not include any special carater in the
> file, otherwise the HASH returned by XMLin() is totaly messed up.

What is the encoding of your file? My guess is that it is in either
ISO-8859-1 (or -15) or some kind of windows-12nn

What happens is that the data is read, probably by expat, and converted
to UTF-8. The "totaly messed up" characters are in fact perfectly valid
UTF-8 characters, that your terminal (or whatever you use to display
them) is not set to display.

If XML::Simple can read it then the encoding must be declared in the XML
declaration, at the beginning of the XML file.

Your choices are either to convert those characters back to the original
encoding, look at the Unicode::* modules on CPAN, or to bite the Unicode
bullet and learn how to work with UTF-8 data. In the long run the second
option makes more sense, but YMMV.

But really, processing XML with perl 5.00503 seems like a bad idea to me.

--
mirod

[XML::Simple-2.12] problems parsing non ASCII strings

am 12.07.2005 19:56:58 von Jul

module: XML::Simple-2.12 (also tried 2.14)
perl version: 5.00503


Hello,

I need to parse and write a XML configuration file wich contains
non-ASCII caraters (like 'é', in french).
I've choosen, XML::Simple with XML::Parser for these tasks, but everything
works fine if and only if I do not include any special carater in the
file, otherwise the HASH returned by XMLin() is totaly messed up.
Below is the comparison of the configuration file 'website.xml'

Thank you for any help you can provide.


Julien


# website.xml



person@foo.ext



person@foo.ext



person@foo.ext


Full name



# Data::Dump of the returned HASH ref

{
contact => {
email => "person\@foo.ext",
label => "Informations g\n \n \n person\@foo.ext\n Directeur de collection\n \n \n person\@foo.ext\n Webmestre\n \n Full name\n",
},
}

Re: [XML::Simple-2.12] problems parsing non ASCII strings

am 13.07.2005 00:44:13 von Jul

Le Tue, 12 Jul 2005 19:16:53 +0200, Michel Rodriguez a écrit :

> Jul wrote:
>> module: XML::Simple-2.12 (also tried 2.14)
>> perl version: 5.00503
>
> Wahouh! Do you know how old this is? 5, 6 years old?

I know it's very very old, that's why I mentionned it, I'm looking for a
way to trick it, like I did for other perl5.6 modules used :o)
I guess we can sometimes rename "hosting solutions" to "hosting problems",
but it would be less attractive to the custommer ;-)

>> I need to parse and write a XML configuration file wich contains
>> non-ASCII caraters (like 'é', in french). I've choosen, XML::Simple
>> with XML::Parser for these tasks, but everything works fine if and only
>> if I do not include any special carater in the file, otherwise the HASH
>> returned by XMLin() is totaly messed up.
>
> What is the encoding of your file? My guess is that it is in either
> ISO-8859-1 (or -15) or some kind of windows-12nn
>
> What happens is that the data is read, probably by expat, and converted
> to UTF-8. The "totaly messed up" characters are in fact perfectly valid
> UTF-8 characters, that your terminal (or whatever you use to display
> them) is not set to display.
>
> If XML::Simple can read it then the encoding must be declared in the XML
> declaration, at the beginning of the XML file.

The default encoding protocol should be ISO-8859-1 or -15, that's why I
expected to retreive the same encoding type.
With the encoding attribute set in the declaration, it goes better, yo'ure
right, and I've been surprised to see that UTF-8 is also supported, even
with perl 5.005 :-)

> Your choices are either to convert those characters back to the original
> encoding, look at the Unicode::* modules on CPAN, or to bite the Unicode
> bullet and learn how to work with UTF-8 data. In the long run the second
> option makes more sense, but YMMV.

Now, the original caracter is displayed as ISO-8859-15, but coded
with UTF-8. You're right again! lol
At this time, I wonder wether UTF-8 is the default carset or wether there
is an option available for XML::Simple or XML::Parser. I took a look into
those modules documentation but didn't get much.
Otherwise, I'll try to convert data outside XML::Simple.

> But really, processing XML with perl 5.00503 seems like a bad idea to me.

I agree with you, but I have no choice right now. I got perl 5.005 in one
hand and a project to rise on the other. Here is what I have to deal with.
Maybe another way to parse a configuration file would be easier, but I
like the idea to have a reason to play with XML, and I didn't really found
what I want with the modules previously tested.


Thank you very much for your help, it's been really usefull to me.


Julien

Re: [XML::Simple-2.12] problems parsing non ASCII strings

am 13.07.2005 07:39:00 von Michel Rodriguez

Jul wrote:

> Now, the original caracter is displayed as ISO-8859-15, but coded
> with UTF-8. You're right again! lol
> At this time, I wonder wether UTF-8 is the default carset or wether there
> is an option available for XML::Simple or XML::Parser. I took a look into
> those modules documentation but didn't get much.
> Otherwise, I'll try to convert data outside XML::Simple.

There is no easy way to get back to the original encoding in
XML::Simple. To get the file written back as ISO-8859-15 you can pipe
the output through iconv.

You could also use XML::Twig:
my $options= { ...}; # XML::Simple options
my $twig= XML::Twig->new( keep_encoding => 1)
->parsefile( $file)
->root
->simplify
;

This will do exactly the same thing as XMLin, except for the bit where
it keeps the original encoding.


Does it help?

--
mirod

Re: [XML::Simple-2.12] problems parsing non ASCII strings

am 17.07.2005 02:55:47 von Jul

Michel Rodriguez a émis l'idée suivante :
> Jul wrote:
>
>> Now, the original caracter is displayed as ISO-8859-15, but coded
>> with UTF-8. You're right again! lol
>> At this time, I wonder wether UTF-8 is the default carset or wether there
>> is an option available for XML::Simple or XML::Parser. I took a look into
>> those modules documentation but didn't get much.
>> Otherwise, I'll try to convert data outside XML::Simple.
>
> There is no easy way to get back to the original encoding in XML::Simple. To
> get the file written back as ISO-8859-15 you can pipe the output through
> iconv.
>
> You could also use XML::Twig:
> my $options= { ...}; # XML::Simple options
> my $twig= XML::Twig->new( keep_encoding => 1)
> ->parsefile( $file)
> ->root
> ->simplify
> ;
>
> This will do exactly the same thing as XMLin, except for the bit where it
> keeps the original encoding.
>

>
> Does it help?


Hello Michel,

I've let down ISO-8859-15 for UTF-8.
As the web browser interface tranfers text fields strings in the page
encoding, I've set it to utf-8.

Thank you again,


Julien

--
Jul... réapparru comme par enchantement