Doing character encoding/decoding within libwww?

Doing character encoding/decoding within libwww?

am 21.09.2007 21:49:26 von david

------=_Part_33467_22713864.1190404166840
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

For most uses of libwww, developers do little with character encoding.
Indeed, for general-case use of LWP::Simple, they can't, because that
information isn't even exposed. Has any thought gone into doing this
internally within libwww, so that when I fetch content, I get back text
instead of octets?

I'd be happy to help work on some of this, but the fact that I see no use of
character encodings within libwww makes me wonder if this is more of a
policy decision not to do it.

David

------=_Part_33467_22713864.1190404166840--

Re: Doing character encoding/decoding within libwww?

am 22.09.2007 20:40:53 von derhoermi

* David Nesting wrote:
>For most uses of libwww, developers do little with character encoding.
>Indeed, for general-case use of LWP::Simple, they can't, because that
>information isn't even exposed. Has any thought gone into doing this
>internally within libwww, so that when I fetch content, I get back text
>instead of octets?

Generally speaking, this is rather difficult as some content may not be
textual at all, and textual formats vary in how applications are to de-
tect the encoding (e.g., XML has different rules than HTML, text/plain
has no rules beyond looking at the charset parameter, and so on). If you
want a general-purpose solution, a good start would be a module taking a
HTTP::Response object and detecting the encoding, possibly decoding it
on request.

>I'd be happy to help work on some of this, but the fact that I see no
>use of character encodings within libwww makes me wonder if this is more
>of a policy decision not to do it.

There was a bit of a discussion to somehow use HTML::Encoding for some
parts of it, which pretty much solves the problem for HTML and XML, cf
the list archives. Help on improving HTML::Encoding would be welcome,
I have little time to work on it at the moment.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

am 22.09.2007 21:04:29 von moseley

On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
> For most uses of libwww, developers do little with character encoding.
> Indeed, for general-case use of LWP::Simple, they can't, because that
> information isn't even exposed. Has any thought gone into doing this
> internally within libwww, so that when I fetch content, I get back text
> instead of octets?

If you have the response object:

$response->decoded_content;

--
Bill Moseley
moseley@hank.org

Re: Doing character encoding/decoding within libwww?

am 22.09.2007 23:50:53 von derhoermi

* Bill Moseley wrote:
>On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
>> For most uses of libwww, developers do little with character encoding.
>> Indeed, for general-case use of LWP::Simple, they can't, because that
>> information isn't even exposed. Has any thought gone into doing this
>> internally within libwww, so that when I fetch content, I get back text
>> instead of octets?
>
>If you have the response object:
>
> $response->decoded_content;

That removes content encodings like gzip and deflate, but David is
asking about character encodings like utf-8 and iso-8859-1. Content
encodings are applied after character encodings.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 01:07:46 von moseley

On Sat, Sep 22, 2007 at 11:50:53PM +0200, Bjoern Hoehrmann wrote:
> * Bill Moseley wrote:
> >On Fri, Sep 21, 2007 at 12:49:26PM -0700, David Nesting wrote:
> >> For most uses of libwww, developers do little with character encoding.
> >> Indeed, for general-case use of LWP::Simple, they can't, because that
> >> information isn't even exposed. Has any thought gone into doing this
> >> internally within libwww, so that when I fetch content, I get back text
> >> instead of octets?
> >
> >If you have the response object:
> >
> > $response->decoded_content;
>
> That removes content encodings like gzip and deflate, but David is
> asking about character encodings like utf-8 and iso-8859-1. Content
> encodings are applied after character encodings.

sub decoded_content {
....

$content_ref = \Encode::decode($charset, $$content_ref,
Encode::FB_CROAK() | Encode::LEAVE_SRC());

--
Bill Moseley
moseley@hank.org

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 01:22:21 von derhoermi

* Bill Moseley wrote:
>sub decoded_content {
> ....
>
> $content_ref = \Encode::decode($charset, $$content_ref,
> Encode::FB_CROAK() | Encode::LEAVE_SRC());

The documentation I re-read earlier even says that... This is still a
far cry from being generally useful though, it only works for text/*
and only if the encoding is specified in the header, or the format does
not use some kind of inline label that is inconsistent with the default.
Most of the time this is not the case, however.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 02:57:17 von david

------=_Part_32035_1919206.1190509037967
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On 9/22/07, Bjoern Hoehrmann wrote:
>
> Generally speaking, this is rather difficult as some content may not be
> textual at all, and textual formats vary in how applications are to de-
> tect the encoding (e.g., XML has different rules than HTML, text/plain
> has no rules beyond looking at the charset parameter, and so on). If you
> want a general-purpose solution, a good start would be a module taking a
> HTTP::Response object and detecting the encoding, possibly decoding it
> on request.


Fortunately, we know the Content-Type at this point, so we can decide if
it's appropriate to decode it as text, and if so, how to go about doing it.

HTML::Encoding seems like it approaches the problem reasonably well, but
ideally, I'd like to be able some day to use LWP::Simple's get() and get
back a logical text string for text/* or application/*+xml. Similarly,
getprint() should do the Right Thing with respect to my locale. Users of
LWP::Simple can't invoke another layer of processing, even if they wanted
to. So, today, it's either "get back octets that may or may not be useful
as text" or "use the full blown LWP::UserAgent and add another layer
(perhaps too-specifically-named HTML::Encoding) to make sure you get text
right." It just seems like we can simplify that.

Thanks for the feedback.

David

------=_Part_32035_1919206.1190509037967--

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 03:05:24 von david

------=_Part_32046_9652212.1190509524232
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On 9/22/07, Bjoern Hoehrmann wrote:
>
> * Bill Moseley wrote:
> >If you have the response object:
> > $response->decoded_content;
>
> That removes content encodings like gzip and deflate, but David is
> asking about character encodings like utf-8 and iso-8859-1. Content
> encodings are applied after character encodings.
>

So after reading Bill's response, I thought to myself the same thing, but
added, "...though that sounds like it would be the perfect place to
implement this."

After checking the code, decoded_content does indeed decode character
encodings and returns text instead of octets! I don't think it used to do
that, but that's great.

It still doesn't help in the LWP::Simple case, though, and if someone is
actually using LWP::Simple for their application, they probably aren't going
to spend the time needed to ensure the octets they get back are meaningful
text either. But this certainly simplifies the problem.

What would people think about just changing LWP::Simple to use
decoded_content instead of content?

David

------=_Part_32046_9652212.1190509524232--

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 03:12:13 von moseley

On Sun, Sep 23, 2007 at 01:22:21AM +0200, Bjoern Hoehrmann wrote:
> * Bill Moseley wrote:
> >sub decoded_content {
> > ....
> >
> > $content_ref = \Encode::decode($charset, $$content_ref,
> > Encode::FB_CROAK() | Encode::LEAVE_SRC());
>
> The documentation I re-read earlier even says that... This is still a
> far cry from being generally useful though, it only works for text/*
> and only if the encoding is specified in the header, or the format does
> not use some kind of inline label that is inconsistent with the default.
> Most of the time this is not the case, however.

It will also find content-type in the markup, IIRC.

It's been a long day. What other mime types are you thinking of other
than text/*?

--
Bill Moseley
moseley@hank.org

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 08:53:14 von david

------=_Part_32307_11866918.1190530394079
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On 9/22/07, Bill Moseley wrote:
>
> It's been a long day. What other mime types are you thinking of other
> than text/*?
>

The most complete implementation imaginable would start with at least these:

text/html (html-specific rules)
text/xml (xml-specific rules)
text/* (general-purpose text rules)
application/*+xml (xml-specific rules)

You'd probably also want this to be extensible, so that I can add my own
media types at run-time to guarantee my non-obvious textual media type is
handled properly.

On the other hand, I'm less convinced now that dipping into the HTML or XML
content to figure out the proper encoding is necessarily the proper thing to
do here. My complaint about LWP::Simple was that the HTTP Content-Type
(charset) information is lost by the time it gets to the caller. If the
data isn't in text at that point, it will never reliably get there. But for
HTML and XML, if the character encoding is actually specified in the
contentrather than in the HTTP headers, then it isn't as important to
deal with it
up front. I could see a case then for dealing with text/* only and
returning octets for everything else, since text/* is the only media type
that has character encoding details in the HTTP headers. That being said,
applications based on LWP::Simple are likely to work better with HTML and
XML "assistance" for the reason I gave earlier: users of LWP::Simple
probably aren't going to take the time to do the proper parsing and
decoding. Yes, it's still "their fault" for not coding a robust
application, but helping them do that is I think still a valid goal, if we
can do it safely.

David

------=_Part_32307_11866918.1190530394079--

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 16:51:25 von moseley

On Sat, Sep 22, 2007 at 11:53:14PM -0700, David Nesting wrote:
> On the other hand, I'm less convinced now that dipping into the HTML or XML
> content to figure out the proper encoding is necessarily the proper thing to
> do here.

Well, it's often needed since content providers may not have the
ability to alter the server's Content-Type header to add the correct
charset.

On the other hand, it probably depends on what you plan to do with the
content. Passing off to a parser (e.g. libxml2) would also figure out
the encoding.

I have a program that uses LWP and used decoded_content but then I
re-encode it before passing it on to the next tool in the chain that
also will decode. But, I've also considered parsing the content and
removing any content-specified charsets and returning utf8 at all
times.

> My complaint about LWP::Simple was that the HTTP Content-Type
> (charset) information is lost by the time it gets to the caller. If
> the data isn't in text at that point, it will never reliably get
> there. But for HTML and XML, if the character encoding is actually
> specified in the contentrather than in the HTTP headers, then it
> isn't as important to deal with it up front. I could see a case
> then for dealing with text/* only and returning octets for
> everything else, since text/* is the only media type that has
> character encoding details in the HTTP headers. That being said,
> applications based on LWP::Simple are likely to work better with
> HTML and XML "assistance" for the reason I gave earlier: users of
> LWP::Simple probably aren't going to take the time to do the proper
> parsing and decoding. Yes, it's still "their fault" for not coding
> a robust application, but helping them do that is I think still a
> valid goal, if we can do it safely.

I'd tend to agree. Make LWP::Simple return decoded content and if you
need more control don'e use LWP::Simple.

--
Bill Moseley
moseley@hank.org

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 18:12:19 von derhoermi

* David Nesting wrote:
>The most complete implementation imaginable would start with at least these:
>
>text/html (html-specific rules)
>text/xml (xml-specific rules)
>text/* (general-purpose text rules)
>application/*+xml (xml-specific rules)

HTML::Encoding does all of these, except text/* (for which there are no
rules beyond checking the charset parameter, though you might also try
to check for a Unicode signature at the beginning, which almost always
indicates the Unicode encoding form, HTML::Encoding can do both but is
not designed to do that for arbitrary types).

>On the other hand, I'm less convinced now that dipping into the HTML or XML
>content to figure out the proper encoding is necessarily the proper thing to
>do here. My complaint about LWP::Simple was that the HTTP Content-Type
>(charset) information is lost by the time it gets to the caller.

Well that is necessarily so to keep the interface simple. Going from
LWP::Simple::get to LWP::UserAgent->new->get(...) is easy enough to not
warrant adding functionality to LWP::Simple.

>I could see a case then for dealing with text/* only and returning octets
>for everything else, since text/* is the only media type that has character
>encoding details in the HTTP headers.

Actually that is not the case, there are plenty of, say, application/*
formats, like the XML types, that carry encoding information in the
header, without replicating it in the content (likewise, information in
the content may not be replicated in the header, and the two may contra-
dict each other).

>Yes, it's still "their fault" for not coding a robust application, but
>helping them do that is I think still a valid goal, if we can do it safely.

Well, automagic decoding of content cannot be added to LWP::Simple with-
out some opt-in switch as that would break a lot of programs, and if you
require some opt-in, you might as well require switching the module.
--
Björn Höhrmann · mailto:bjoern@hoehrmann.de · http://bjoern.hoehrmann.de
Weinh. Str. 22 · Telefon: +49(0)621/4309674 · http://www.bjoernsworld.de
68309 Mannheim · PGP Pub. KeyID: 0xA4357E78 · http://www.websitedev.de/

Re: Doing character encoding/decoding within libwww?

am 23.09.2007 21:29:36 von david

------=_Part_33209_4277658.1190575776161
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

On 9/23/07, Bjoern Hoehrmann wrote:
>
>
> Well that is necessarily so to keep the interface simple. Going from
> LWP::Simple::get to LWP::UserAgent->new->get(...) is easy enough to not
> warrant adding functionality to LWP::Simple.


My concern, though, is that with this approach, LWP::Simple isn't just
lacking features: it's harmful. Users of LWP::Simple today cannot guarantee
that the octets they get are usable as text. So long as applications use
it, these applications will never be properly internationalizable and we
will continue seeing new applications written that don't properly handle
character encodings.

Actually that is not the case, there are plenty of, say, application/*
> formats, like the XML types, that carry encoding information in the
> header, without replicating it in the content (likewise, information in
> the content may not be replicated in the header, and the two may contra-
> dict each other).


I didn't notice that application/xml and +xml media types also made the HTTP
charset authoritative. Basically, my thought is that if it follows these
rules (by placing it in the HTTP headers), it seems appropriate to decode it
as text. Otherwise, the charset information will require some closer
inspection, but but could easily be done by the caller even if they use
LWP::Simple.


> Well, automagic decoding of content cannot be added to LWP::Simple with-
> out some opt-in switch as that would break a lot of programs, and if you
> require some opt-in, you might as well require switching the module.


That's certainly a good argument. You could also just supplement its
methods with variants that attempt to return text instead of octets, and
deprecate or at least discourage the use of the other methods when you're
expecting text. (It might be appropriate to print out a warning when an
octet-based method is used to fetch a textual media type.)

If LWP::Simple can't be easily changed to manage character encodings
cluefully, reasonably completely, and transparently to the caller, the
responsible thing to do would be to add some verbiage to its documentation
making this clear and discouraging its use altogether for retrieving text.

David

------=_Part_33209_4277658.1190575776161--