LWP: Warning with utf8 data in HTML head section

LWP: Warning with utf8 data in HTML head section

am 03.08.2006 00:47:56 von libwww

There seems to be a bug in LWP which causes a warning in
HTML::HeadParser on fetched web documents which contain utf-8 encoded
data in the header section.

Example:

use strict;
use LWP;
use 5.008;

my $url = 'http://perlmeister.com/test/utf8.html';
my $ua = LWP::UserAgent->new();
my $res = $ua->get($url);

This snippet shows the warning

Parsing of undecoded UTF-8 will give garbage when decoding
entities at /home/y/lib/perl5/site_perl/5.8/LWP/Protocol.pm line
114.

with LWP-5.805 and HTML-Parser-3.55.

HTML::HeadParser issues this warning if it finds UTF-8 encoded data
but the string handed in doesn't have the utf-8 bit set.

Setting the utf-8 bit on web server responses which indicate
UTF-8 content in a content header like 'text/html; charset=utf-8'
seems to be one possible solution, but this header setting might also
occur in the HTML header section, which HTML::HeadParser is supposed
to parse:



in which case the warning probably needs to be suppressed until
HTML::HeadParser is done and has verified that there's no such setting
in the HTML head.

-- Mike

Mike Schilli
m@perlmeister.com