Byte Order Mark mucks up headers

am 07.10.2004 11:11:54 von phil.archer

Hi,

I've read Sean Burke's book, I've looked through the archives of this list
and done other searches but can't find an answer to a problem I have found
with LWP. If the character coding for a website has a byte order mark
(things like utf-16, all that "big endian/little endian" stuff) then LWP
can't interpret HTML headers in the usual way. Does anyone know a way around
this?

Background:

I work for an organisation called ICRA. We provide a self-labelling and
filtering system for the web, currently based on the old PICS standard but
soon to move to RDF. A couple of years ago I built a tool for our website
that visits a site, checks for PICS labels and parses them if found. Now, I
can strip out the BOM from the content where found and do other clunky
processing but that would mean I can't use LWP's efficient header commands.
For sites without a BOM I can just get header->('Pics-label') and process
that.

You can see the label tester at www.icra.org/label/tester/

An example of a site with a BOM that shows as unlabelled even though it is:
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http %3A%2F%2Fwww.xtranslations.com&showHead=on&showContent=on

An example of a site with a label without a BOM (i.e one that works as it
should) would be
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http %3A%2F%2Fwww.yahoo.com&showHead=on&showContent=on

Any help gratefully accepted.

Phil.

Phil Archer
Chief Technical Officer
Internet Content Rating Association
Label your site today at http://www.icra.org

Re: Byte Order Mark mucks up headers

am 07.10.2004 11:26:51 von gisle

"Phil Archer" writes:

> I've read Sean Burke's book, I've looked through the archives of this
> list and done other searches but can't find an answer to a problem I
> have found with LWP. If the character coding for a website has a byte
> order mark (things like utf-16, all that "big endian/little endian"
> stuff) then LWP can't interpret HTML headers in the usual way. Does
> anyone know a way around this?

HML::HeadParser needs to be fixed. It will assume that there is no
section when it sees text before anything else. The part of
the code responsible for this currently allows whitespace, but needs
to be tought that BOM is harmless too. Look at the 'text' method.

Do you want to try to provide a patch?

Regards,
Gisle

Re: Byte Order Mark mucks up headers

am 07.10.2004 13:13:35 von phil.archer

Thanks for this Gisle.

Now I know where to start, I'll take a look at the code and see if it's
within my abilities to patch it. I notice that Doc types and XML
declarations are OK so it must be choosy about what it allows and disallows
(sensibly enough!)

Phil.

----- Original Message -----
From: "Gisle Aas"
To: "Phil Archer"
Cc: "libwww list"
Sent: Thursday, October 07, 2004 10:26 AM
Subject: Re: Byte Order Mark mucks up headers

> "Phil Archer" writes:
>
>> I've read Sean Burke's book, I've looked through the archives of this
>> list and done other searches but can't find an answer to a problem I
>> have found with LWP. If the character coding for a website has a byte
>> order mark (things like utf-16, all that "big endian/little endian"
>> stuff) then LWP can't interpret HTML headers in the usual way. Does
>> anyone know a way around this?
>
> HML::HeadParser needs to be fixed. It will assume that there is no
> section when it sees text before anything else. The part of
> the code responsible for this currently allows whitespace, but needs
> to be tought that BOM is harmless too. Look at the 'text' method.
>
> Do you want to try to provide a patch?
>
> Regards,
> Gisle
>

Re: Byte Order Mark mucks up headers

am 07.10.2004 15:42:29 von phil.archer

OK, I've done some digging and testing - and my "poking in the dark" hasn't
come up with the magic answer (mainly because I'm an amateur code-writer
who's a few fathoms out of his depth here).

In HTML::HeadParser I tried adding another regEx substitution to the
flush_text routine - no effect.

Eventually I went back to the main HTML::Parser and tried changing that. The
parse_file routine is the main guts of the module I believe? I tried
stripping out any BOMs found as the chunks of data were read in but again,
no effect.

Incidentally, I found a routine on the W3C site that I was able to quickly
adapt to detect/strip BOMs (see below). What I can't see, I'm sorry to say,
is where it should go in. Again, a cry for help!

Routine for detecting/stripping BOMs: (see
http://dev.w3.org/cvsweb/p3p-validator/20001215/xml.pl?rev=1 .5)

sub check_bom {
my $content = shift;
my $top1 = unpack("C", substr($content, 0, 1));
my $top2 = unpack("C", substr($content, 1, 1));
my $top3 = unpack("C", substr($content, 2, 1));
my $top4 = unpack("C", substr($content, 3, 1));

# UTF-8
if($top1 eq 239 && $top2 eq 187 && $top3 eq 191) {
$content = substr($content, 3, length($content) - 3);
}

# UTF-16 little endian
if($top1 eq 255 && $top2 eq 254) {
$content = substr($content, 2, length($content) - 2);
}

# UTF-16 big endian
if($top1 eq 254 && $top2 eq 255) {
$content = substr($content, 2, length($content) - 2);
}

# UTF-32 little endian
if($top1 eq 255 && $top2 eq 254 && $top3 eq 0 && $top4 eq 0) {
$content = substr($content, 4, length($content) - 4);
}

# UTF-32 big endian
if($top1 eq 254 && $top2 eq 255 && $top3 eq 0 && $top4 eq 0) {
$content = substr($content, 4, length($content) - 4);
}
return $content;
}

Phil.

----- Original Message -----
From: "Gisle Aas"
To: "Phil Archer"
Cc: "libwww list"
Sent: Thursday, October 07, 2004 10:26 AM
Subject: Re: Byte Order Mark mucks up headers

> "Phil Archer" writes:
>
>> I've read Sean Burke's book, I've looked through the archives of this
>> list and done other searches but can't find an answer to a problem I
>> have found with LWP. If the character coding for a website has a byte
>> order mark (things like utf-16, all that "big endian/little endian"
>> stuff) then LWP can't interpret HTML headers in the usual way. Does
>> anyone know a way around this?
>
> HML::HeadParser needs to be fixed. It will assume that there is no
> section when it sees text before anything else. The part of
> the code responsible for this currently allows whitespace, but needs
> to be tought that BOM is harmless too. Look at the 'text' method.
>
> Do you want to try to provide a patch?
>
> Regards,
> Gisle
>

Re: Byte Order Mark mucks up headers

am 07.10.2004 19:15:28 von ville.skytta

On Thu, 2004-10-07 at 16:42, Phil Archer wrote:

> Incidentally, I found a routine on the W3C site that I was able to quickly
> adapt to detect/strip BOMs (see below).

There's also http://search.cpan.org/dist/File-BOM/ which might or might
not be useful. If you're looking for even more alternatives, the W3C
Markup Validator also has some BOM stuff in the main "check" script...

Suggested resolution (was Re: Byte Order Mark mucks up headers)

am 18.10.2004 17:42:46 von phil.archer

Dear all,

A couple of weeks ago I raised an issue about Byte Order Marks effectively
disabling the header parsing functions. Thanks again to those who took the
trouble to reply. After a bit of poking around in the dark I've fixed it
for my own needs - I leave it to others to judge whether this is robust
enough for general usage, especially since I have only carried out cursory
testing.

Within the HTML Head Parser there is a routing called text like this:

sub text
{
my($self, $text) = @_;
print "TEXT[$text]\n" if $DEBUG;
my $tag = $self->{tag};
if (!$tag && $text =~ /\S/) {
# Normal text means start of body
$self->eof;
return;
}
return if $tag ne 'title';
$self->{'text'} .= $text;
}

This is where the byte order mark is detected and the process stops since
it's text outside any tag. So I've just added an extra term to the if
statement thus:

sub text
{
my($self, $text) = @_;
print "TEXT[$text]\n" if $DEBUG;
my $tag = $self->{tag};
if (!$tag && $text =~ /\S/ && !BOM($text)) {
# Normal text means start of body
$self->eof;
return;
}
return if $tag ne 'title';
$self->{'text'} .= $text;
}

And defined a little routine thus:

sub BOM {
my $text = shift;
my $top1 = unpack("C", substr($text, 0, 1));
my $top2 = unpack("C", substr($text, 1, 1));
my $top3 = unpack("C", substr($text, 2, 1));
my $top4 = unpack("C", substr($text, 3, 1));

# UTF-8
if($top1 eq 239 && $top2 eq 187 && $top3 eq 191) {
return 'UTF-8';
}

# UTF-16 little endian
if($top1 eq 255 && $top2 eq 254) {
return 'UTF-16 little endian';
}

# UTF-16 big endian
if($top1 eq 254 && $top2 eq 255) {
return 'UTF-16 big endian';
}

# UTF-32 little endian
if($top1 eq 255 && $top2 eq 254 && $top3 eq 0 && $top4 eq 0) {
return 'UTF-32 little endian';
}

# UTF-32 big endian
if($top1 eq 254 && $top2 eq 255 && $top3 eq 0 && $top4 eq 0) {
return 'UTF-32 big endian';
}
return 0;
}

This is an adaptation of a routine found at
http://dev.w3.org/cvsweb/p3p-validator/20001215/xml.pl?rev=1 .5.

I have not been able to test this on any BOMs other than UTF-8. If you use a
BOM other than that, I'd be very pleased to hear of it.

The changes are in place on the ICRA label tester:
www.icra.org/label/tester/ (this looks for PICS labels in the headers, hence
the importance of this bit of LWP for me!). In the original e-mail for this
thread I gave the following two examples, both of which now work correctly:

An example of a site with a BOM that previously showed as having no headers
but is now OK:
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http %3A%2F%2Fwww.xtranslations.com&showHead=on&showContent=on

An example of a site with a label without a BOM (that still works as it
should!)
http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http %3A%2F%2Fwww.yahoo.com&showHead=on&showContent=on

Phil Archer
Chief Technical Officer
Internet Content Rating Association
Label your site today at http://www.icra.org

----- Original Message -----
From: "Phil Archer"
To: "libwww list"
Sent: Thursday, October 07, 2004 10:11 AM
Subject: Byte Order Mark mucks up headers

> Hi,
>
> I've read Sean Burke's book, I've looked through the archives of this list
> and done other searches but can't find an answer to a problem I have found
> with LWP. If the character coding for a website has a byte order mark
> (things like utf-16, all that "big endian/little endian" stuff) then LWP
> can't interpret HTML headers in the usual way. Does anyone know a way
> around this?
>
> Background:
>
> I work for an organisation called ICRA. We provide a self-labelling and
> filtering system for the web, currently based on the old PICS standard but
> soon to move to RDF. A couple of years ago I built a tool for our website
> that visits a site, checks for PICS labels and parses them if found. Now,
> I can strip out the BOM from the content where found and do other clunky
> processing but that would mean I can't use LWP's efficient header
> commands. For sites without a BOM I can just get header->('Pics-label')
> and process that.
>
> You can see the label tester at www.icra.org/label/tester/
>
> An example of a site with a BOM that shows as unlabelled even though it
> is:
> http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http %3A%2F%2Fwww.xtranslations.com&showHead=on&showContent=on
>
> An example of a site with a label without a BOM (i.e one that works as it
> should) would be
> http://www.icra.org/cgi-bin/labelTester.cgi?lang=EN&url=http %3A%2F%2Fwww.yahoo.com&showHead=on&showContent=on
>
> Any help gratefully accepted.
>
> Phil.
>
> Phil Archer
> Chief Technical Officer
> Internet Content Rating Association
> Label your site today at http://www.icra.org
>
>
>