libwww throws warning when used with HTML::Parser 3.44

libwww throws warning when used with HTML::Parser 3.44

am 03.01.2005 23:22:31 von kdebisschop

--------------080402030701050900020908
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit

I just plugged in HTML::Parser 3.44 on my FC2 servers in order to handle
utf-8 encoded content. (Boy was I glad to see that was available)

But when running a robot, LWP::Protocol emits a warning as it works
because the content stream is not decoded into perl's native character set.

One possible fix is:


@@ -109,8 +110,14 @@

if (!defined($arg) || !$response->is_success) {
+ my $encoding;
+ if ($response->header("Content-Type") &&
$response->header("Content-Type") =~ m/;\s*charset\s*=\s*(\S+)\s*$/i) {
+ $encoding = $1;
+ } else {
+ $encoding = "iso-8859-1";
+ }
# scalar
while ($content = &$collector, length $$content) {
if ($parser) {
- $parser->parse($$content) or undef($parser);
+ $parser->parse(decode($encoding,$$content)) or
undef($parser);
}
LWP::Debug::debug("read " . length($$content) . " bytes");


I have attached this as a patch and a unified diff, which you may use as
you see fit.

--
K

--------------080402030701050900020908
Content-Type: text/plain;
name="Protocol.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="Protocol.diff"

--- /usr/lib/perl5/vendor_perl/5.8.3/LWP/Protocol.pm 2004-04-09 11:36:52.000000000 -0400
+++ Protocol.pm 2005-01-03 16:58:23.000000000 -0500
@@ -9,4 +9,5 @@
use strict;
use Carp ();
+use Encode;
use HTTP::Status ();
use HTTP::Response;
@@ -109,8 +110,14 @@

if (!defined($arg) || !$response->is_success) {
+ my $encoding;
+ if ($response->header("Content-Type") && $response->header("Content-Type") =~ m/;\s*charset\s*=\s*(\S+)\s*$/i) {
+ $encoding = $1;
+ } else {
+ $encoding = "iso-8859-1";
+ }
# scalar
while ($content = &$collector, length $$content) {
if ($parser) {
- $parser->parse($$content) or undef($parser);
+ $parser->parse(decode($encoding,$$content)) or undef($parser);
}
LWP::Debug::debug("read " . length($$content) . " bytes");

--------------080402030701050900020908
Content-Type: text/plain;
name="Protocol.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="Protocol.patch"

10a11
> use Encode;
110a112,117
> my $encoding;
> if ($response->header("Content-Type") && $response->header("Content-Type") =~ m/;\s*charset\s*=\s*(\S+)\s*$/i) {
> $encoding = $1;
> } else {
> $encoding = "iso-8859-1";
> }
114c121
< $parser->parse($$content) or undef($parser);
---
> $parser->parse(decode($encoding,$$content)) or undef($parser);

--------------080402030701050900020908--

Re: libwww throws warning when used with HTML::Parser 3.44

am 03.01.2005 23:41:33 von perl

In <41D9C5A7.7000201@alert.infoplease.com> on 03 Jan 2005,
Karl DeBisschop wrote:
> I just plugged in HTML::Parser 3.44 on my FC2 servers in order to
> handle utf-8 encoded content. (Boy was I glad to see that was
> available)

This patch should be modified to allow people with older versions of
perl to keep using them with LWP. I believe that only 5.8 has the
Encode module (a welcome addition, in my opinion) and that the Encode
compatibility module only works with certain older perl versions.

--
Charles C. Fu
Founder
Web i18n, LLC
www.web-i18n.net

Re: libwww throws warning when used with HTML::Parser 3.44

am 04.01.2005 03:15:00 von derhoermi

* Karl DeBisschop wrote:
>I just plugged in HTML::Parser 3.44 on my FC2 servers in order to handle
>utf-8 encoded content. (Boy was I glad to see that was available)
>
>But when running a robot, LWP::Protocol emits a warning as it works
>because the content stream is not decoded into perl's native character set.

See http://www.nntp.perl.org/group/perl.libwww/6017 and the relevant
thread for a recent discussion on this. Your patch has a number of
problems, parsing the encoding out of the charset parameter is a bit
more difficult than your regular expression (e.g., the encoding name
might be a quoted-string as in charset="utf-8"), the routine would now
croak in common cases such as an unsupported character encoding, and
it fails to deal with encodings such as ISO-2022-JP that maintain a
state (see Encode::PerlIO) or where characters might be longer than
one octet such as UTF-8 (consider one chunk has "Bj\xC3" and the other
chunk has "\xB6rn", you need to know the \xC3 when decoding the \xB6).
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: libwww throws warning when used with HTML::Parser 3.44

am 04.01.2005 10:34:00 von christophe

Charles C. Fu a =E9crit :
> I believe that only 5.8 has the Encode module

IIRC the Encode module came with 5.7
but you can use Encode::compat with 5.6
(ex on a Debian stable/woody system)

Christophe