LWP & UTF-8.
am 05.06.2006 17:17:01 von podbelsky
Howdy,
While creating a bot for LJ i've encountered with the following error:
"Parsing of undecoded UTF-8 will give garbage when decoding entities at D:/perl.5.8.8/site/lib/LWP/Protocol.pm line 114."
Here's the piece of code, which reproduces the error:
###
my $lj_url = 'http://www.livejournal.com/';
my $cj = HTTP::Cookies->new();
my $ua = LWP::UserAgent->new(agent => 'Howdy?', cookie_jar => $cj);
$ua->default_header('Accept-Language' => 'ru, en',
'Accept-Charset' => 'utf-8;q=1, *;q=0.1',
'Referer' => $lj_url);
print "Getting the login form...\n";
$res = $ua->get($lj_url . 'login.bml?nojs=1');
exit;
It's said in HTML::HeadParser's pod that:
"Note that the HTML::HeadParser might get confused if raw undecoded
UTF-8 is passed to the parse() method. Make sure the strings are
properly decoded before passing them on."
And error seems to be on this line in LWP's Protocol.pm:
114: $parser->parse($$content) or undef($parser);
If i make a change, so the content gets decoded before being parsed:
$parser->parse(decode_utf8($$content)) or undef($parser);
the error message fades away.
Is there's something i'm doing wrong or is it really a bug?
If it helps:
Perl version: 5.8.8
Binary build 817 [257965] provided by ActiveState http://www.ActiveState.com
Built Mar 20 2006 17:54:25
LWP version: $VERSION = "5.805"; # $Id: LWP.pm,v 1.149 2005/12/08 12:06:22 gisle Exp $
Got perl from the activestate, everything is from the pack.
Thanks in advance. (Shalom ebanats!)
Re: LWP & UTF-8.
am 06.06.2006 11:32:42 von gisle
"¿ÞÔÑÕÛìáÚØ=D9 =B2.=B2." writes:
> While creating a bot for LJ i've encountered with the following error:
> "Parsing of undecoded UTF-8 will give garbage when decoding entities at D=
:/perl.5.8.8/site/lib/LWP/Protocol.pm line 114."
>=20
> Here's the piece of code, which reproduces the error:
> ###
> my $lj_url =3D 'http://www.livejournal.com/';
> my $cj =3D HTTP::Cookies->new();
> my $ua =3D LWP::UserAgent->new(agent =3D> 'Howdy?', cookie_jar =3D> $cj);
> $ua->default_header('Accept-Language' =3D> 'ru, en',
> 'Accept-Charset' =3D> 'utf-8;q=3D1, *;q=3D0.1',
> 'Referer' =3D> $lj_url);
> print "Getting the login form...\n";
> $res =3D $ua->get($lj_url . 'login.bml?nojs=3D1');
> exit;
>=20
> It's said in HTML::HeadParser's pod that:
> "Note that the HTML::HeadParser might get confused if raw undecoded
> UTF-8 is passed to the parse() method. Make sure the strings are
> properly decoded before passing them on."
>=20
> And error seems to be on this line in LWP's Protocol.pm:
> 114: $parser->parse($$content) or undef($parser);
> If i make a change, so the content gets decoded before being parsed:
> $parser->parse(decode_utf8($$content)) or undef($parser);
> the error message fades away.
>=20
> Is there's something i'm doing wrong or is it really a bug?
Yes, it's a bug. The data we feed the $parser here should really be
decoded in a similar way to what the 'decoded_content' method of
HTTP::Message provide. If you for instance send requests with
'Accept-Encoding: gzip' then LWP might end up feeding binary stuff to
the parser. What LWP needs is to set up some decoding pipeline that
can decode content as it is received in chunks. I have not gotten
around it it yet :)
A workaround might be to just disable this head-parsing business, with:
$ua =3D LWP::UserAgent->new(...., parse_head =3D> 0);
or by calling:
$ua->parse_head(0);
after the $ua object has been constructed. The most important
downside is that the $response->base might not be accurate, but you
might not care about that.
--Gisle