Reading XML file - chars being dropped

Reading XML file - chars being dropped

am 17.11.2007 21:09:03 von mw

Hello people,

I have a PHP script parsing an XML file, and am having a problem when
the characterData read contains extended characters (such as é). The
ef_characterData function is the character data handler for the XML
parser, and when I feed it an XML file like the one below, the string
$ef['title'] only contains the string "é to the White House" - the first
few characters are lost for some reason.

If I try to echo the $data variable right at the place where the
assignment to $ef['title'] occurs, $data contains the entire string
("Attaché to the White House"). Seems like the assignment operator is
truncating the string.

I figure this is because of PHP's limitations with 256 chars, but does
anyone have a workaround?


function ef_characterData($parser, $data) {
global $curTag, $ef;
$titleKey = "^ROOT^TITLE";
if ($curTag == $titleKey) $ef['title'] = $data;
}




The local Attaché to the white house



Thanks in advance!

MW

Re: Reading XML file - chars being dropped

am 17.11.2007 21:32:26 von Dikkie Dik

> If I try to echo the $data variable right at the place where the
> assignment to $ef['title'] occurs, $data contains the entire string
> ("Attaché to the White House"). Seems like the assignment operator is
> truncating the string.
>
> I figure this is because of PHP's limitations with 256 chars, but does
> anyone have a workaround?

What limitations? Strings can be "absurd" long.

If you think the assignment is the problem, have you tried what
$ef['title'] is directly before and after the assignment?

>
>
> function ef_characterData($parser, $data) {
> global $curTag, $ef;
> $titleKey = "^ROOT^TITLE";
> if ($curTag == $titleKey) $ef['title'] = $data;
> }
>

>
>
>
> The local Attaché to the white house
>

>

Re: Reading XML file - chars being dropped

am 17.11.2007 21:52:33 von mw

Dikkie Dik wrote:
>
> What limitations? Strings can be "absurd" long.
>

By limitations I mean that the charset is 8-bit, only 256 unique chars.
If my string has an character like the french accent "e", it can lead to
problems.

>
> If you think the assignment is the problem, have you tried what
> $ef['title'] is directly before and after the assignment?
>

$ef['title'] is empty before the assignment, and "é to the White House"
after assignment. The funny part is that if I echo both variables right
below the line where the assignment occurs, $data is "Attaché to the
White House" and $ef['title'] is "é to the White House"

MW

Re: Reading XML file - chars being dropped

am 18.11.2007 01:28:23 von Dikkie Dik

>> What limitations? Strings can be "absurd" long.
> By limitations I mean that the charset is 8-bit, only 256 unique chars.
> If my string has an character like the french accent "e", it can lead to
> problems.

Well, yes, but string handling should be binary-safe in recent versions
of PHP. I use utf-8 a lot, and I never ran into that kind of problems.
The only thing I have to take care of is the fact that some characters
are represented by more than one "character".

>> If you think the assignment is the problem, have you tried what
>> $ef['title'] is directly before and after the assignment?
> $ef['title'] is empty before the assignment, and "é to the White House"
> after assignment. The funny part is that if I echo both variables right
> below the line where the assignment occurs, $data is "Attaché to the
> White House" and $ef['title'] is "é to the White House"


That is really strange. I never encountered anything like it. Does it
help (as an ugly workaround) to make it a reference assignment?
Like: $ef['title'] &= $data;

If so, it might help to "clone"-assign it to a non-array (local)
variable first and then "reference"-assign that that local variable to
the $ef['title']

Just curious...

Re: Reading XML file - chars being dropped

am 18.11.2007 01:54:03 von mw

Problem solved with a strange workaround - I changed the assignment line
to $ef['title'].=$data. For some reason the assignment was happening in
two steps - the first step would transfer the part before the 'é' and
the second would transfer the rest. By adding the concatenate operator I
bypassed the issue.

MW

Dikkie Dik wrote:
>>> What limitations? Strings can be "absurd" long.
>> By limitations I mean that the charset is 8-bit, only 256 unique chars.
>> If my string has an character like the french accent "e", it can lead to
>> problems.
>
> Well, yes, but string handling should be binary-safe in recent versions
> of PHP. I use utf-8 a lot, and I never ran into that kind of problems.
> The only thing I have to take care of is the fact that some characters
> are represented by more than one "character".
>
>>> If you think the assignment is the problem, have you tried what
>>> $ef['title'] is directly before and after the assignment?
>> $ef['title'] is empty before the assignment, and "é to the White House"
>> after assignment. The funny part is that if I echo both variables right
>> below the line where the assignment occurs, $data is "Attaché to the
>> White House" and $ef['title'] is "é to the White House"
>
>
> That is really strange. I never encountered anything like it. Does it
> help (as an ugly workaround) to make it a reference assignment?
> Like: $ef['title'] &= $data;
>
> If so, it might help to "clone"-assign it to a non-array (local)
> variable first and then "reference"-assign that that local variable to
> the $ef['title']
>
> Just curious...

Re: Reading XML file - chars being dropped

am 21.11.2007 00:32:09 von mw

Re-vising this problem, I have discovered that the ef_characterData
function is called twice by the parser, once with the part of the string
before the "é" and then again with the rest of the string (including the é)

While I investigate, I think the problem is because my XML file is
external. Before I feed it to the parser, I am reading it into a
variable using file_get_contents() - I think the assignment here is
creating the problem.

Will keep you guys posted, but if anybody has a similar problem can you
let me know?

MW

MW wrote:
> Problem solved with a strange workaround - I changed the assignment line
> to $ef['title'].=$data. For some reason the assignment was happening in
> two steps - the first step would transfer the part before the 'é' and
> the second would transfer the rest. By adding the concatenate operator I
> bypassed the issue.
>
> MW
>
> Dikkie Dik wrote:
>>>> What limitations? Strings can be "absurd" long.
>>> By limitations I mean that the charset is 8-bit, only 256 unique chars.
>>> If my string has an character like the french accent "e", it can lead to
>>> problems.
>> Well, yes, but string handling should be binary-safe in recent versions
>> of PHP. I use utf-8 a lot, and I never ran into that kind of problems.
>> The only thing I have to take care of is the fact that some characters
>> are represented by more than one "character".
>>
>>>> If you think the assignment is the problem, have you tried what
>>>> $ef['title'] is directly before and after the assignment?
>>> $ef['title'] is empty before the assignment, and "é to the White House"
>>> after assignment. The funny part is that if I echo both variables right
>>> below the line where the assignment occurs, $data is "Attaché to the
>>> White House" and $ef['title'] is "é to the White House"
>>
>> That is really strange. I never encountered anything like it. Does it
>> help (as an ugly workaround) to make it a reference assignment?
>> Like: $ef['title'] &= $data;
>>
>> If so, it might help to "clone"-assign it to a non-array (local)
>> variable first and then "reference"-assign that that local variable to
>> the $ef['title']
>>
>> Just curious...

Re: Reading XML file - chars being dropped

am 21.11.2007 00:47:10 von mw

Followup:

Yes, the initial XML reading is the problem. The following code fixes
the problem:

$xml_data=file_get_contents($ef_url) or die("could not open XML input");
if (!mb_check_encoding($xml_data, "US-ASCII"))
$data=mb_convert_encoding($xml_data, "US-ASCII");

The obvious problem being, of course, that all unicode chars now appear
as "??" in the output. Using ISO-8859-1 doesn't help with the duplicate
characterdata procedure call.

MW


MW wrote:
> Re-vising this problem, I have discovered that the ef_characterData
> function is called twice by the parser, once with the part of the string
> before the "é" and then again with the rest of the string (including the é)
>
> While I investigate, I think the problem is because my XML file is
> external. Before I feed it to the parser, I am reading it into a
> variable using file_get_contents() - I think the assignment here is
> creating the problem.
>
> Will keep you guys posted, but if anybody has a similar problem can you
> let me know?
>
> MW
>
> MW wrote:
>> Problem solved with a strange workaround - I changed the assignment line
>> to $ef['title'].=$data. For some reason the assignment was happening in
>> two steps - the first step would transfer the part before the 'é' and
>> the second would transfer the rest. By adding the concatenate operator I
>> bypassed the issue.
>>
>> MW
>>
>> Dikkie Dik wrote:
>>>>> What limitations? Strings can be "absurd" long.
>>>> By limitations I mean that the charset is 8-bit, only 256 unique chars.
>>>> If my string has an character like the french accent "e", it can lead to
>>>> problems.
>>> Well, yes, but string handling should be binary-safe in recent versions
>>> of PHP. I use utf-8 a lot, and I never ran into that kind of problems.
>>> The only thing I have to take care of is the fact that some characters
>>> are represented by more than one "character".
>>>
>>>>> If you think the assignment is the problem, have you tried what
>>>>> $ef['title'] is directly before and after the assignment?
>>>> $ef['title'] is empty before the assignment, and "é to the White House"
>>>> after assignment. The funny part is that if I echo both variables right
>>>> below the line where the assignment occurs, $data is "Attaché to the
>>>> White House" and $ef['title'] is "é to the White House"
>>> That is really strange. I never encountered anything like it. Does it
>>> help (as an ugly workaround) to make it a reference assignment?
>>> Like: $ef['title'] &= $data;
>>>
>>> If so, it might help to "clone"-assign it to a non-array (local)
>>> variable first and then "reference"-assign that that local variable to
>>> the $ef['title']
>>>
>>> Just curious...