HTTP::Request::Common::POST and UTF-8

am 27.09.2005 13:21:28 von Stephen Collyer

I'm passing a string containing UTF-8 to HTTP::Request::Common::POST
and the UTF-8 seems to be destroyed during the encoding required
for application/x-www-form-urlencoded. (I get 4 UTF-8 chars encoded
to two spaces, AFAICS).

Is this a known problem with LWP ?

Any suggestions for a quick fix ?

Steve Collyer

Re: HTTP::Request::Common::POST and UTF-8

am 27.09.2005 13:56:21 von flavell

On Tue, 27 Sep 2005, Stephen Collyer wrote:

> I'm passing a string containing UTF-8 to HTTP::Request::Common::POST
> and the UTF-8 seems to be destroyed during the encoding required
> for application/x-www-form-urlencoded.

I don't know the answer to your question, but, in principle the web
specifications say that application/x-www-form-urlencoded is only
guaranteed to support us-ascii. You and I know, in practical terms,
that it may not be as bad as that, and when executing a GET we'd have
no other choice; but if you're using POST rather than GET, then you
might be advised to use multipart/form-data instead.

Have you proved that the transaction that you're trying to carry out
can be successfully initiated "by hand" or from a web browser, before
you try to implement it from LWP? Just to be sure you're looking in
the right place for the problem, I mean.

> Is this a known problem with LWP ?

If I was confronted with this problem, I'd write a short test-case to
investigate what was happening. If the test didn't reveal the problem
to me, I'd consider posting the complete code here.

Does Perl know that this is a utf-8 text string i.e in the sense of
the Unicode support that is in Perl 5.8+ versions? Or are you handing
it around as binary, or what?

Sorry I can't be of more help "off the top of my head".

Re: HTTP::Request::Common::POST and UTF-8

am 27.09.2005 20:08:36 von Stephen Collyer

Alan J. Flavell wrote:

> I don't know the answer to your question, but, in principle the web
> specifications say that application/x-www-form-urlencoded is only
> guaranteed to support us-ascii. You and I know, in practical terms,
> that it may not be as bad as that, and when executing a GET we'd have
> no other choice; but if you're using POST rather than GET, then you
> might be advised to use multipart/form-data instead.

1. Yes, I've read your nice web page on the matter, so AFAICS it
should be possible

2. I'm currently constrained to application/x-www-form-urlencoded
but, yes, it may make more sense to use multipart/form-data.

> Have you proved that the transaction that you're trying to carry out
> can be successfully initiated "by hand" or from a web browser, before
> you try to implement it from LWP? Just to be sure you're looking in
> the right place for the problem, I mean.

Yes, this is working code that I'm reworking for UTF-8 support.
So I know precisely where the problem is.

> If the test didn't reveal the problem
> to me, I'd consider posting the complete code here.

I've investigated to the point that I can see that the problem
seems to occur at line 53 of URI::_query::query_form:

53:b $self->query(@query ? join('&', @query) : undef);

This routine escapes the data in the POST content array, and
all seems well up to line 53 where it sets the content of the
query. When I look at $self->query(), all UTF-8 chars seem to have
been converted to +. This looks bizarre as it's only doing a join.

I need to investigate this further - it should be easy enough to
cook up a small example to reproduce if it is indeed a bug.

This is using perl, v5.8.3

>
> Does Perl know that this is a utf-8 text string i.e in the sense of
> the Unicode support that is in Perl 5.8+ versions? Or are you handing
> it around as binary, or what?

Yes, these are marked as UTF-8 according to Encode::is_utf8.

Steve Collyer

Re: HTTP::Request::Common::POST and UTF-8

am 27.09.2005 21:03:31 von Stephen Collyer

Alan J. Flavell wrote:
> On Tue, 27 Sep 2005, Stephen Collyer wrote:

>>Is this a known problem with LWP ?
>
> If I was confronted with this problem, I'd write a short test-case to
> investigate what was happening. If the test didn't reveal the problem
> to me, I'd consider posting the complete code here.

Here's a simple program that demonstrates something similar, if not
the precise behaviour that I'm seeing:

############################################################ ##

#!/usr/bin/perl

use LWP;
use HTTP::Request::Common;
use Encode;
use charnames qw(greek);

binmode(STDOUT, ":utf8");

my $utf8_data = "<\N{alpha}\N{beta}\N{gamma}\N{delta}>";

print $utf8_data, "\n\n";

print Encode::is_utf8($utf8_data)
? "\$utf8_data marked as UTF-8\n\n"
: "\$utf8_data not marked as UTF-8\n\n";

my $request = POST("http://192.168.0.1/test",
Content => [
data => $utf8_data,
more_data => "some more data",
]
);

my $req_string = $request->as_string();

print Encode::is_utf8($req_string)
? "\$req_string marked as UTF-8\n\n"
: "\$req_string not marked as UTF-8\n\n";

print $req_string, "\n";

############################################################ ###

The UTF-8 chars seem to have disappeared totally, though in
the code I am debugging, they are converted to encoded spaces
(i.e. +)

Steve Collyer

Re: HTTP::Request::Common::POST and UTF-8

am 27.09.2005 22:31:08 von flavell

On Tue, 27 Sep 2005, Stephen Collyer wrote:

> I've investigated to the point that I can see that the problem
> seems to occur at line 53 of URI::_query::query_form:
>
> 53:b $self->query(@query ? join('&', @query) : undef);

Please understand that I'm thinking aloud here: I don't have the
answer, but, as no-one else has stepped in, I thought my ponderings
might just be helpful.

Hmmm, the version of _query.pm that I'm looking at here (which
might be old) invokes URI::Escape::escapes{$1}

Looking at http://search.cpan.org/~gaas/URI-1.35/URI/Escape.pm
it appears there are two different functions, for escaping in an
8-bit context and for escaping in a utf8 context. As it says, they
produce different results, even for the characters from 128-255.

However, if I look at the URI/Escape.pm that's installed hereabouts,
it describes itself as Revision 3.21, and shows no sign of being
capable of escaping any character above 255.

> This routine escapes the data in the POST content array, and
> all seems well up to line 53 where it sets the content of the
> query.

Seems to me that one needs to take a look whether there's any
machinery, in the version that you're using, for invoking the
utf8-context escapes, and, if so, how to trigger it. I'm not by any
means certain that the mere utf8-ness of a string would be the right
lever to trigger this, to be honest.

> When I look at $self->query(), all UTF-8 chars seem to have
> been converted to +. This looks bizarre as it's only doing a join.

My hunch is that they've been offered to a routine that can only
escape the characters 0-255.

hope this is vaguely useful at least.

Re: HTTP::Request::Common::POST and UTF-8

am 28.09.2005 14:02:51 von Stephen Collyer

Alan J. Flavell wrote:

>>When I look at $self->query(), all UTF-8 chars seem to have
>>been converted to +. This looks bizarre as it's only doing a join.
>
>
> My hunch is that they've been offered to a routine that can only
> escape the characters 0-255.

AFAICS this isn't the case, but I may be wrong.

For what it's worth, the following regex on line 16 of package
URI::_query seems to be causing the problem.

$q =~ s/([^$URI::uric])/$URI::Escape::escapes{$1}/go;

I've no idea what's going on exactly but it looks to me
like the escaping is occuring twice for some reason, with
the UTF8 intact the first time, but destroyed the second time,
after passing through the substitution above.

Steve Collyer