UTF-8 problem

am 22.08.2007 00:23:17 von Todor Vachkov

Hello all,

I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
Thus I got this error message:

>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>encoding !
>Bytes: 0xE2 0x26 0x6C 0x74

I thought the solution would be:

>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
>my $parser = XML::LibXML->new();
>my $dom = $parser->parse_fh($fh);
>my $root = $dom->getDocumentElement;

but this produce a long long list (maybe for each parsed character in the xml file) of error messages :

>utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
..
..
..
>utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
>Segmentation fault

The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
I just want to let the modul to parse the xml file, which is really large (over 20MB)
and has being exported from another software. Thus I haven't any influence what comes into it.

I hope you can help me! Thanks in advance!

Greetings Todor

Re: UTF-8 problem

am 22.08.2007 04:03:12 von sln

On Wed, 22 Aug 2007 00:23:17 +0200, Todor Vachkov wrote:

XML has restrictions on tag names when it comes to Unicode. Tags and attributes share
this restriction.
If the name were a problem, it would have shown something as a result of an invald
start or end tag and shown the error. Or at least should have. I don't think there should be
that type of escape char in content, I could be wrong.

You should try a preliminary parser to get you to the actual location of your problem,
like RxParse 1.1. Its a formal SAX parser as well, but it also has built in debugging
that will output a log file of parsed xml. RxParse is a total Perl parser and probably
the fastest on the planet.

As a guideline, this is the XML 1.1 Unicode range for tag/attribute names (clearly E2 is there):

@UC_Nstart = (
"\\x{C0}-\\x{D6}",
"\\x{D8}-\\x{F6}",
"\\x{F8}-\\x{2FF}",
"\\x{370}-\\x{37D}",
"\\x{37F}-\\x{1FFF}",
"\\x{200C}-\\x{200D}",
"\\x{2070}-\\x{218F}",
"\\x{2C00}-\\x{2FEF}",
"\\x{3001}-\\x{D7FF}",
"\\x{F900}-\\x{FDCF}",
"\\x{FDF0}-\\x{FFFD}",
"\\x{10000}-\\x{EFFFF}",
);
@UC_Nchar = (
"\\x{B7}",
"\\x{0300}-\\x{036F}",
"\\x{203F}-\\x{2040}",
);
$Nstrt = "[A-Za-z_:".join ('',@UC_Nstart)."]";
$Nchar = "[-\\w:\\.".join ('',@UC_Nchar).join ('',@UC_Nstart)."]";
$Name = "(?:$Nstrt$Nchar*?)";

>Hello all,
>
>I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
>Thus I got this error message:
>
>>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>>encoding !
>>Bytes: 0xE2 0x26 0x6C 0x74
>
>I thought the solution would be:
>
>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
>>my $parser = XML::LibXML->new();
>>my $dom = $parser->parse_fh($fh);
>>my $root = $dom->getDocumentElement;
>
>but this produce a long long list (maybe for each parsed character in the xml file) of error messages :
>
>>utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
>.
>.
>.
>>utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
>>Segmentation fault
>
>The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
>I just want to let the modul to parse the xml file, which is really large (over 20MB)
>and has being exported from another software. Thus I haven't any influence what comes into it.
>
>I hope you can help me! Thanks in advance!
>
>Greetings Todor

Re: UTF-8 problem

am 22.08.2007 04:19:02 von 1usa

Todor Vachkov wrote in
news:5j16ulF3saghfU1@mid.dfncis.de:

> Hello all,
>
> I'm trying to convert an exported xml file into a perl data structre
> with the XML::LibXML modul. Thus I got this error message:
>
>>Entity: line 315442: parser error : Input is not proper UTF-8,
>>indicate encoding !
>>Bytes: 0xE2 0x26 0x6C 0x74
>
> I thought the solution would be:
>
>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');

The file contents are not UTF-8. Specify the real encoding.

Sinan

PS: Avoid RxParse

--
A. Sinan Unur <1usa@llenroc.ude.invalid>
(remove .invalid and reverse each component for email address)
clpmisc guidelines:

Re: UTF-8 problem

am 22.08.2007 16:11:12 von Ted Zlatanov

On Wed, 22 Aug 2007 00:23:17 +0200 Todor Vachkov wrote:

TV> Hello all,
TV> I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
TV> Thus I got this error message:

>> Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>> encoding !
>> Bytes: 0xE2 0x26 0x6C 0x74

TV> I thought the solution would be:

>> open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
>> my $parser = XML::LibXML->new();
>> my $dom = $parser->parse_fh($fh);
>> my $root = $dom->getDocumentElement;

TV> but this produce a long long list (maybe for each parsed character in the xml file) of error messages :

>> utf8 "\xE2" does not map to Unicode at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
TV> .
TV> .
TV> .
>> utf8 "\xE4" does not map to Unicode >at /perlmodules/lib/i586-linux-thread-multi/XML/LibXML.pm line 429.
>> Segmentation fault

TV> The segmentaion fail always at the same \xE4 character, but it's a secondary problem.
TV> I just want to let the modul to parse the xml file, which is really large (over 20MB)
TV> and has being exported from another software. Thus I haven't any influence what comes into it.

Can you post with the first 50 lines of the file, or put up a smaller
complete version of it online somewhere we can examine it? Your post
doesn't help at all with finding the problem (we can only guess that
your input file is not valid).

Ted

Re: UTF-8 problem

am 22.08.2007 17:55:50 von Todor Vachkov

Thanks for your replies!

The xml file is really huge - it has 666.025 lines and it is result of an export from a software.

It contents:
- the meta description of the software itself (i am pretty sure that it is conform to UTF-8)
- form inputs made by users. Thus, they fill out the software with information about several
databases.The goal is to have a distributed search engine. (again, I assume that the software
also saves the inputs in UTF-8)
- perl scripts for each database, which are written by various programmers. The scripts are
the interfaces between the databases and the software (the UTF-8 encoding of the scripts is not guaranteed)
All this stuff is contained by the huge XML file.

Parsing the file with XML::LibXML gives:

>Entity: line 315442: parser error : Input is not proper UTF-8, indicate Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â Â
>encoding !
>Bytes: 0xE2 0x26 0x6C 0x74

I've figured out that this are the characters :

* U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
Ã¢ (Ã)

* U+0026 AMPERSAND
&

* U+006C LATIN SMALL LETTER L
l (L)

* U+0074 LATIN SMALL LETTER T
t (T)

Line 315442 looks:
><refpt id="bafn1"/><lk refid="afn1"><sup>Ã¢</sup></lk>
^

The element contains a single line from a perl script as mentioned above. The character 0xE2 was the point,
where the parser stopped, at line 315442, it went far enough, almost to the half.

It seems that the perl scripts within are my problem. I'am wondering why this single character is being treated from parser
as a non utf-8 code point? Could I tell the parser somehow to ignore this?

Thanks for your help!

Greetings, Todor

Re: UTF-8 problem

am 22.08.2007 21:08:44 von Martijn Lievaart

On Wed, 22 Aug 2007 17:55:50 +0200, Todor Vachkov wrote:

> Parsing the file with XML::LibXML gives:
>
> >Entity: line 315442: parser error : Input is not proper UTF-8,
> >indicate encoding !
> >Bytes: 0xE2 0x26 0x6C 0x74
>
> I've figured out that this are the characters :
>
> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
> Ã¢ (Ã)

U+00E2 is Unicode. In utf-8 encoding this would be a two character
sequence. So your input is not proper utf-8.

HTH,
M4

Re: UTF-8 problem

am 22.08.2007 21:52:16 von Todor Vachkov

Martijn Lievaart wrote:

>> Parsing the file with XML::LibXML gives:
>>
>> >Entity: line 315442: parser error : Input is not proper UTF-8,
>> >indicate encoding !
>> >Bytes: 0xE2 0x26 0x6C 0x74
>>
>> I've figured out that this are the characters :
>>
>> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
>> Ã¢ (Ã)
>
> U+00E2 is Unicode. In utf-8 encoding this would be a two character
> sequence. So your input is not proper utf-8.

Thanks for your posting!

The parser says:
>Bytes: 0xE2 0x26 0x6C 0x74
So 0xE2 is meant to be the problematic character.

U+00E2 was not in the error message, I've just pasted the output of my check on linux with:
user@timemashine:~$ unicode 0xe2
U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
UTF-8: c3 a2 UTF-16BE: 00e2 Decimal: â
Ã¢ (Ã)
Uppercase: U+00C2
Category: Ll (Letter, Lowercase)
Bidi: L (Left-to-Right)
Decomposition: 0061 0302

Greetings Todor

Re: UTF-8 problem

am 22.08.2007 22:32:01 von Martijn Lievaart

On Wed, 22 Aug 2007 21:52:16 +0200, Todor Vachkov wrote:

> Martijn Lievaart wrote:
>
>>> Parsing the file with XML::LibXML gives:
>>>
>>> >Entity: line 315442: parser error : Input is not proper
>>> >UTF-8, indicate encoding !
>>> >Bytes: 0xE2 0x26 0x6C 0x74
>>>
>>> I've figured out that this are the characters :
>>>
>>> * U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX
>>> Ã¢ (Ã)
>>
>> U+00E2 is Unicode. In utf-8 encoding this would be a two character
>> sequence. So your input is not proper utf-8.
>
> Thanks for your posting!
>
> The parser says:
> >Bytes: 0xE2 0x26 0x6C 0x74
> So 0xE2 is meant to be the problematic character.
>
> U+00E2 was not in the error message, I've just pasted the output of my
> check on linux with:
> user@timemashine:~$ unicode 0xe2
> U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX UTF-8: c3 a2
> UTF-16BE: 00e2 Decimal: â Ã¢ (Ã)
> Uppercase: U+00C2
> Category: Ll (Letter, Lowercase)
> Bidi: L (Left-to-Right)
> Decomposition: 0061 0302

But 0xE2 seems to be the problematic character. It is not utf-8! Your
imputfile seems to be encoded in most probably latin-1 or latin-15, not
utf-8.

M4

Re: UTF-8 problem

am 24.08.2007 10:06:52 von Joe Smith

Todor Vachkov wrote:

> parser error : Input is not proper UTF-8, indicate encoding !
> I thought the solution would be:
>
> open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');

It's the other way around.

open (my $fh, "< :encoding(something_that_is_not_UTF-8)", ...);

Perl is assuming the file is utf8, but it's not, therefore you need
to tell perl what encoding the file is actually using.

-Joe

Re: UTF-8 problem

am 25.08.2007 23:04:38 von hjp-usenet2

On 2007-08-21 22:23, Todor Vachkov wrote:
> Hello all,
>
> I'm trying to convert an exported xml file into a perl data structre with the XML::LibXML modul.
> Thus I got this error message:
>
>>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>>encoding !
>>Bytes: 0xE2 0x26 0x6C 0x74
>
> I thought the solution would be:
>
>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');

Don't do this. XML-files contain an indication of their encoding, you
should treat them as binary files

open(my $fh, "< :raw" ,'/foodir/export.xml');

and let the XML parser do the rest.

It that doesn't work, the encoding stored in the file is probably
wrong, either because the generating software was buggy or because
someone already incorrectly converted the file. You may have luck by
fixing the encoding (it should be in the first line which looks like
this:

If the encoding is missing, UTF-8 is assumed).

--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Sysadmin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"

Re: UTF-8 problem

am 26.08.2007 00:59:10 von Todor Vachkov

Peter J. Holzer wrote:

> On 2007-08-21 22:23, Todor Vachkov wrote:
>> Hello all,
>>
>> I'm trying to convert an exported xml file into a perl data structre with
>> the XML::LibXML modul. Thus I got this error message:
>>
>>>Entity: line 315442: parser error : Input is not proper UTF-8, indicate
>>>encoding !
>>>Bytes: 0xE2 0x26 0x6C 0x74
>>
>> I thought the solution would be:
>>
>>>open(my $fh, "< :encoding(utf8)" ,'/foodir/export.xml');
>
> Don't do this. XML-files contain an indication of their encoding, you
> should treat them as binary files
>
> open(my $fh, "< :raw" ,'/foodir/export.xml');
>
> and let the XML parser do the rest.
>
> It that doesn't work, the encoding stored in the file is probably
> wrong, either because the generating software was buggy or because
> someone already incorrectly converted the file. You may have luck by
> fixing the encoding (it should be in the first line which looks like
> this:
>
>
>
> If the encoding is missing, UTF-8 is assumed).
>
Thanks for your reply Peter!

I'm using now XML::Smart and so I don't have the UTF-8 problem anymore.
The file has the declaration

As I already mentioned, it contains source code from perl scripts and I
found out that some of them are iso-8859-1 encoded. Especially the german "Umlaute" made some trouble as you know;)

Greetings,
Todor