Fetching the charset when set in meta.
Fetching the charset when set in meta.
am 05.01.2007 22:50:06 von moseley
$response->headers->as_string shows this:
[...]
Content-Length: 17604
Content-Type: text/html
Content-Type: text/html;charset=UTF-8
Last-Modified: Mon, 23 Oct 2006 14:32:14 GMT
[...]
Is there a better way to grab the charset than this?
my $charset;
for ( $response->header('content-type') ){
$charset = $1 if /\bcharset=([^;]+)/;
}
my $content = Encode::decode( $charset, $response->content )
if $charset;
I'm assuming that the last Content-type header found is the charset to
use.
my ($ct, $other) = $response->content_type;
is only returning the first Content-Type header.
How do other's decode content fetched with LWP? Do you parse out
charset and then decode it like above?
Doesn't that seem like something that should be part of LWP (or some
sub-class)?
use LWP::Simple;
$content = get( $url );
Shouldn't $content be decoded there?
--
Bill Moseley
moseley@hank.org
Re: Fetching the charset when set in meta.
am 08.01.2007 15:57:20 von gisle
On 1/5/07, Bill Moseley wrote:
> $response->headers->as_string shows this:
>
> [...]
> Content-Length: 17604
> Content-Type: text/html
> Content-Type: text/html;charset=UTF-8
> Last-Modified: Mon, 23 Oct 2006 14:32:14 GMT
> [...]
>
> Is there a better way to grab the charset than this?
>
> my $charset;
> for ( $response->header('content-type') ){
> $charset = $1 if /\bcharset=([^;]+)/;
> }
> my $content = Encode::decode( $charset, $response->content )
> if $charset;
LWP already provide a method to decode the content for you:
my $content = $response->decoded_content;
This method will also undo various Content-Encodings for you. If you
look at the source for decoded_content you will notice that it parse
the Content-Type header using:
if (my @ct =
HTTP::Headers::Util::split_header_words($self->header("Conte nt-Type")))
{
($ct, undef, %ct_param) = @{$ct[-1]};
$ct = lc($ct);
}
> I'm assuming that the last Content-type header found is the charset to
> use.
The code I quoted from decoded_content picks the last one.
>
> my ($ct, $other) = $response->content_type;
>
> is only returning the first Content-Type header.
>
>
> How do other's decode content fetched with LWP? Do you parse out
> charset and then decode it like above?
>
>
> Doesn't that seem like something that should be part of LWP (or some
> sub-class)?
>
> use LWP::Simple;
> $content = get( $url );
>
> Shouldn't $content be decoded there?
That would probably be an improvement, but it might also break some
code that depends on just receving the raw binary content asis.
--
Gisle Aas
Re: Fetching the charset when set in meta.
am 08.01.2007 18:58:38 von moseley
On Mon, Jan 08, 2007 at 03:57:20PM +0100, Gisle Aas wrote:
> LWP already provide a method to decode the content for you:
>
> my $content = $response->decoded_content;
Ah, thanks.
> This method will also undo various Content-Encodings for you.
> look at the source for decoded_content you will notice that it parse
> the Content-Type header using:
>
> if (my @ct =
> HTTP::Headers::Util::split_header_words($self->header("Conte nt-Type")))
> {
> ($ct, undef, %ct_param) = @{$ct[-1]};
> $ct = lc($ct);
> }
It would be handy to have a charset method.
A few questions.
Since $res->decoded_content decodes encodings, is there any code to
help set Accept-Encoding based on what you use for uncompressing the
content?
I was doing the uncompressing manually, and setting my
Accept-Encoding by first seeing what modules could be loaded.
Think it's safe for my code to test for Compress::Zlib and then add
"gzip, x-gzip, deflate" as Accept-Encoding?
I need to look back at my code, too, as I'm using
IO::Uncompress::RawInflate now for "deflate" -- I was using
Compress::Zlib::uncompress() but it wasn't working -- and switching to
IO::Uncompress solved it. Must have been my implementation and not
the module after all as you are using Compress::Zlib for deflate.
--
Bill Moseley
moseley@hank.org
Re: Fetching the charset when set in meta.
am 08.01.2007 20:11:54 von gisle
On 1/8/07, Bill Moseley wrote:
> On Mon, Jan 08, 2007 at 03:57:20PM +0100, Gisle Aas wrote:
>
> > This method will also undo various Content-Encodings for you.
> > look at the source for decoded_content you will notice that it parse
> > the Content-Type header using:
> >
> > if (my @ct =
> > HTTP::Headers::Util::split_header_words($self->header("Conte nt-Type")))
> > {
> > ($ct, undef, %ct_param) = @{$ct[-1]};
> > $ct = lc($ct);
> > }
>
> It would be handy to have a charset method.
Noted.
> A few questions.
>
> Since $res->decoded_content decodes encodings, is there any code to
> help set Accept-Encoding based on what you use for uncompressing the
> content?
No. One possible way would be to just create test HTTP::Response
object with an empty gzip body and then try to decode it.
use LWP::UserAgent;
require HTTP::Response;
my $ua = LWP::UserAgent->new;
$ua->default_headers->push_header(Accept_Encoding => "gzip, x-gzip")
if defined(HTTP::Response->new(200, "OK",
[Content_Encoding => "gzip"],
pack("H*","1f8b08089194a2450003780003000000000000000000"),
)->decoded_content);
> I was doing the uncompressing manually, and setting my
> Accept-Encoding by first seeing what modules could be loaded.
>
> Think it's safe for my code to test for Compress::Zlib and then add
> "gzip, x-gzip, deflate" as Accept-Encoding?
I think so to.
--Gisle
Re: Fetching the charset when set in meta.
am 12.01.2007 22:09:17 von moseley
On Mon, Jan 08, 2007 at 08:11:54PM +0100, Gisle Aas wrote:
> >It would be handy to have a charset method.
>
> Noted.
Also for the wish list:
It would be nice if the uncompressing was separate from the decoding
the charset.
I'm spidering and then piping the output to another program. So, I
need to get the uncompressed and decoded content into the spider for
processing (and extracting links). That's where
$response->decoded_content;
comes in handy.
But, I also need to print the content back out. The content
needs to be re-encoded at this point.
I have two choices for encoding the output: one is to re-encode into
the original charset (hence $response->charset would be handy), or I
could always output utf8, but then I'd need to replace the charset in
any http-equiv meta tags to reflect the utf8 encoding.
And if uncompressing was a separate method from decodings I could just
print $response->uncoompressed_content;
and print the content uncompressed, but still encoded in the original
encoding.
--
Bill Moseley
moseley@hank.org
Re: Fetching the charset when set in meta.
am 12.01.2007 22:12:10 von miyagawa
I had the same problem and wrote a new module HTTP::Response::Charset
that exactly does it, i.e: add charset() method to $response object.
http://use.perl.org/~miyagawa/journal/31250
On 1/12/07, Bill Moseley wrote:
> On Mon, Jan 08, 2007 at 08:11:54PM +0100, Gisle Aas wrote:
> > >It would be handy to have a charset method.
> >
> > Noted.
>
> Also for the wish list:
>
> It would be nice if the uncompressing was separate from the decoding
> the charset.
>
> I'm spidering and then piping the output to another program. So, I
> need to get the uncompressed and decoded content into the spider for
> processing (and extracting links). That's where
>
> $response->decoded_content;
>
> comes in handy.
>
> But, I also need to print the content back out. The content
> needs to be re-encoded at this point.
>
> I have two choices for encoding the output: one is to re-encode into
> the original charset (hence $response->charset would be handy), or I
> could always output utf8, but then I'd need to replace the charset in
> any http-equiv meta tags to reflect the utf8 encoding.
>
> And if uncompressing was a separate method from decodings I could just
>
> print $response->uncoompressed_content;
>
> and print the content uncompressed, but still encoded in the original
> encoding.
>
>
>
> --
> Bill Moseley
> moseley@hank.org
>
>
--
Tatsuhiko Miyagawa
Re: Fetching the charset when set in meta.
am 12.01.2007 23:23:43 von gisle
On 1/12/07, Bill Moseley wrote:
> And if uncompressing was a separate method from decodings I could just
>
> print $response->uncoompressed_content;
>
> and print the content uncompressed, but still encoded in the original
> encoding.
This would be the same as:
print $response->decoded_content(charset => "none");
--
Gisle Aas
Re: Fetching the charset when set in meta.
am 13.01.2007 02:30:17 von moseley
On Fri, Jan 12, 2007 at 11:23:43PM +0100, Gisle Aas wrote:
> This would be the same as:
>
> print $response->decoded_content(charset => "none");
You mean just like the docs say? ;)
--
Bill Moseley
moseley@hank.org