libwww throws warning when used with HTML::Parser 3.44
am 03.01.2005 23:22:31 von kdebisschop--------------080402030701050900020908
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
I just plugged in HTML::Parser 3.44 on my FC2 servers in order to handle
utf-8 encoded content. (Boy was I glad to see that was available)
But when running a robot, LWP::Protocol emits a warning as it works
because the content stream is not decoded into perl's native character set.
One possible fix is:
@@ -109,8 +110,14 @@
if (!defined($arg) || !$response->is_success) {
+ my $encoding;
+ if ($response->header("Content-Type") &&
$response->header("Content-Type") =~ m/;\s*charset\s*=\s*(\S+)\s*$/i) {
+ $encoding = $1;
+ } else {
+ $encoding = "iso-8859-1";
+ }
# scalar
while ($content = &$collector, length $$content) {
if ($parser) {
- $parser->parse($$content) or undef($parser);
+ $parser->parse(decode($encoding,$$content)) or
undef($parser);
}
LWP::Debug::debug("read " . length($$content) . " bytes");
I have attached this as a patch and a unified diff, which you may use as
you see fit.
--
K
--------------080402030701050900020908
Content-Type: text/plain;
name="Protocol.diff"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="Protocol.diff"
--- /usr/lib/perl5/vendor_perl/5.8.3/LWP/Protocol.pm 2004-04-09 11:36:52.000000000 -0400
+++ Protocol.pm 2005-01-03 16:58:23.000000000 -0500
@@ -9,4 +9,5 @@
use strict;
use Carp ();
+use Encode;
use HTTP::Status ();
use HTTP::Response;
@@ -109,8 +110,14 @@
if (!defined($arg) || !$response->is_success) {
+ my $encoding;
+ if ($response->header("Content-Type") && $response->header("Content-Type") =~ m/;\s*charset\s*=\s*(\S+)\s*$/i) {
+ $encoding = $1;
+ } else {
+ $encoding = "iso-8859-1";
+ }
# scalar
while ($content = &$collector, length $$content) {
if ($parser) {
- $parser->parse($$content) or undef($parser);
+ $parser->parse(decode($encoding,$$content)) or undef($parser);
}
LWP::Debug::debug("read " . length($$content) . " bytes");
--------------080402030701050900020908
Content-Type: text/plain;
name="Protocol.patch"
Content-Transfer-Encoding: 7bit
Content-Disposition: inline;
filename="Protocol.patch"
10a11
> use Encode;
110a112,117
> my $encoding;
> if ($response->header("Content-Type") && $response->header("Content-Type") =~ m/;\s*charset\s*=\s*(\S+)\s*$/i) {
> $encoding = $1;
> } else {
> $encoding = "iso-8859-1";
> }
114c121
< $parser->parse($$content) or undef($parser);
---
> $parser->parse(decode($encoding,$$content)) or undef($parser);
--------------080402030701050900020908--