CGI and character encoding

CGI and character encoding

am 24.02.2011 22:31:07 von aw

Hi.

I wonder if someone here can give me a clue as to where to look...

I am using
Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with Suhosin-Patch
mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 Perl/v5.10.0

perl -MCGI -e 'print $CGI::VERSION'
3.52

A perl cgi-bin script running under mod_perl, receives posted form parameters from a form
defined as such :

"http://www.w3.org/TR/html4/loose.dtd">



.....

enctype="multipart/form-data" charset="UTF-8" method="POST">
....

....

(Note: the html page itself has been saved as UTF-8 by an UTF-8 aware editor)


When I retrieve the above hidden field using

my $chars = $cgi->param('de-utf8');

the variable $chars does contain the proper UTF-8 encoded *bytes* for the above string (in
other words, 2 bytes per character e.g.), but it arrives into the script /without/ the
perl "utf8" flag set.

If I then use this value to print to a filehandle opened as such :

open(FH,'>:utf8',"myfile");
print FH $chars,"\n";

It comes out of course as .. well, I cannot type this on my keyboard, but anyone aware of
double-encoding issues can imagine the "A-tilde Copyright A-tilde squiggle.. " result.

I can of course convert it, by using

$chars = Encode::decode('utf8',$cgi->param('de-utf8'));

but it is a p.i.t.a. and I would like to know if there is a way to retrieve the posted
value directly as UTF-8, and if yes what this depends on.
(I cannot find a setting for instance in the CGI.pm module documentation.)


Thanks.
André

P.S.
Unfortunately, when the browser (Firefox 3.5.3) is posting this data to the server, it is
posting it as something like

....
Content-Type multipart/form-data; boundary=---------------------------326972172326727
....

-----------------------------326972172326727
Content-Disposition: form-data; name="de-utf8"

ÄäÖöÜü
-----------------------------326972172326727

which means that there is no charset header to the parts either.

Re: CGI and character encoding

am 24.02.2011 22:36:07 von mpeters

On 02/24/2011 04:31 PM, André Warnier wrote:

> I wonder if someone here can give me a clue as to where to look...

The CGI.pm documentation talks about the -utf8 import flag which is
probably what you're looking for. But it does caution not to use it for
anything that needs to do file uploads.

--
Michael Peters
Plus Three, LP

RE: CGI and character encoding

am 24.02.2011 23:32:43 von Lloyd Richardson

RldJVywgd2l0aCBDR0kucG0gSSBhbHdheXMgaXRlcmF0ZSB0aHJvdWdoIHRo ZSBwYXJhbXMgYW5k
IEVuY29kZTo6ZGVjb2RlIHdpdGggdGhlIGFwcHJvcHJpYXRlIGVuY29kaW5n IHdpdGggYW4gZXhj
ZXB0aW9uIGZvciBhbnl0aGluZyBiaW5hcnkuIChmaWxlIHVwbG9hZHMgZXRj KQ0KDQoNCi0tLS0t
T3JpZ2luYWwgTWVzc2FnZS0tLS0tDQpGcm9tOiBBbmRyw6kgV2FybmllciBb bWFpbHRvOmF3QGlj
ZS1zYS5jb21dIA0KU2VudDogVGh1cnNkYXksIEZlYnJ1YXJ5IDI0LCAyMDEx IDM6MzEgUE0NClRv
OiBtb2RfcGVybCBsaXN0DQpTdWJqZWN0OiBDR0kgYW5kIGNoYXJhY3RlciBl bmNvZGluZw0KDQpI
aS4NCg0KSSB3b25kZXIgaWYgc29tZW9uZSBoZXJlIGNhbiBnaXZlIG1lIGEg Y2x1ZSBhcyB0byB3
aGVyZSB0byBsb29rLi4uDQoNCkkgYW0gdXNpbmcNCkFwYWNoZS8yLjIuOSAo RGViaWFuKSBEQVYv
MiBTVk4vMS41LjEgbW9kX2prLzEuMi4yNiBQSFAvNS4yLjYtMStsZW5ueTkg d2l0aCBTdWhvc2lu
LVBhdGNoIA0KbW9kX3NzbC8yLjIuOSBPcGVuU1NMLzAuOS44ZyBtb2RfYXBy ZXEyLTIwMDUxMjMx
LzIuNi4wIG1vZF9wZXJsLzIuMC40IFBlcmwvdjUuMTAuMA0KDQpwZXJsIC1N Q0dJIC1lICdwcmlu
dCAkQ0dJOjpWRVJTSU9OJw0KMy41Mg0KDQpBIHBlcmwgY2dpLWJpbiBzY3Jp cHQgcnVubmluZyB1
bmRlciBtb2RfcGVybCwgcmVjZWl2ZXMgcG9zdGVkIGZvcm0gcGFyYW1ldGVy cyBmcm9tIGEgZm9y
bSANCmRlZmluZWQgYXMgc3VjaCA6DQoNCjwhRE9DVFlQRSBIVE1MIFBVQkxJ QyAiLS8vVzNDLy9E
VEQgSFRNTCA0LjAxIFRyYW5zaXRpb25hbC8vRU4iDQogICAgICAgICJodHRw Oi8vd3d3LnczLm9y
Zy9UUi9odG1sNC9sb29zZS5kdGQiPg0KPGh0bWw+DQoJPGhlYWQ+DQogICAg ICAgICA8bWV0YSBo
dHRwLWVxdWl2PSJDb250ZW50LVR5cGUiIGNvbnRlbnQ9InRleHQvaHRtbDtj aGFyc2V0PVVURi04
Ij4NCi4uLi4NCiAgPGJvZHk+DQoJPGZvcm0gYWN0aW9uPSIvbGl0ZmRtL2xp dGZkbS5wbCIgbmFt
ZT0iZm9ybSINCgkJZW5jdHlwZT0ibXVsdGlwYXJ0L2Zvcm0tZGF0YSIgY2hh cnNldD0iVVRGLTgi
IG1ldGhvZD0iUE9TVCI+DQouLi4NCjxpbnB1dCBuYW1lPSJkZS11dGY4IiB0 eXBlPSJoaWRkZW4i
IHZhbHVlPSLDhMOkw5bDtsOcw7wiPg0KLi4uDQoNCihOb3RlOiB0aGUgaHRt bCBwYWdlIGl0c2Vs
ZiBoYXMgYmVlbiBzYXZlZCBhcyBVVEYtOCBieSBhbiBVVEYtOCBhd2FyZSBl ZGl0b3IpDQoNCg0K
V2hlbiBJIHJldHJpZXZlIHRoZSBhYm92ZSBoaWRkZW4gZmllbGQgdXNpbmcN Cg0KbXkgJGNoYXJz
ID0gJGNnaS0+cGFyYW0oJ2RlLXV0ZjgnKTsNCg0KdGhlIHZhcmlhYmxlICRj aGFycyBkb2VzIGNv
bnRhaW4gdGhlIHByb3BlciBVVEYtOCBlbmNvZGVkICpieXRlcyogZm9yIHRo ZSBhYm92ZSBzdHJp
bmcgKGluIA0Kb3RoZXIgd29yZHMsIDIgYnl0ZXMgcGVyIGNoYXJhY3RlciBl LmcuKSwgYnV0IGl0
IGFycml2ZXMgaW50byB0aGUgc2NyaXB0IC93aXRob3V0LyB0aGUgDQpwZXJs ICJ1dGY4IiBmbGFn
IHNldC4NCg0KSWYgSSB0aGVuIHVzZSB0aGlzIHZhbHVlIHRvIHByaW50IHRv IGEgZmlsZWhhbmRs
ZSBvcGVuZWQgYXMgc3VjaCA6DQoNCm9wZW4oRkgsJz46dXRmOCcsIm15Zmls ZSIpOw0KcHJpbnQg
RkggJGNoYXJzLCJcbiI7DQoNCkl0IGNvbWVzIG91dCBvZiBjb3Vyc2UgYXMg Li4gd2VsbCwgSSBj
YW5ub3QgdHlwZSB0aGlzIG9uIG15IGtleWJvYXJkLCBidXQgYW55b25lIGF3 YXJlIG9mIA0KZG91
YmxlLWVuY29kaW5nIGlzc3VlcyBjYW4gaW1hZ2luZSB0aGUgIkEtdGlsZGUg Q29weXJpZ2h0IEEt
dGlsZGUgc3F1aWdnbGUuLiAiIHJlc3VsdC4NCg0KSSBjYW4gb2YgY291cnNl IGNvbnZlcnQgaXQs
IGJ5IHVzaW5nDQoNCiRjaGFycyA9IEVuY29kZTo6ZGVjb2RlKCd1dGY4Jywk Y2dpLT5wYXJhbSgn
ZGUtdXRmOCcpKTsNCg0KYnV0IGl0IGlzIGEgcC5pLnQuYS4gYW5kIEkgd291 bGQgbGlrZSB0byBr
bm93IGlmIHRoZXJlIGlzIGEgd2F5IHRvIHJldHJpZXZlIHRoZSBwb3N0ZWQg DQp2YWx1ZSBkaXJl
Y3RseSBhcyBVVEYtOCwgYW5kIGlmIHllcyB3aGF0IHRoaXMgZGVwZW5kcyBv bi4NCihJIGNhbm5v
dCBmaW5kIGEgc2V0dGluZyBmb3IgaW5zdGFuY2UgaW4gdGhlIENHSS5wbSBt b2R1bGUgZG9jdW1l
bnRhdGlvbi4pDQoNCg0KVGhhbmtzLg0KQW5kcsOpDQoNClAuUy4NClVuZm9y dHVuYXRlbHksIHdo
ZW4gdGhlIGJyb3dzZXIgKEZpcmVmb3ggMy41LjMpIGlzIHBvc3RpbmcgdGhp cyBkYXRhIHRvIHRo
ZSBzZXJ2ZXIsIGl0IGlzIA0KcG9zdGluZyBpdCBhcyBzb21ldGhpbmcgbGlr ZQ0KDQouLi4NCkNv
bnRlbnQtVHlwZQltdWx0aXBhcnQvZm9ybS1kYXRhOyBib3VuZGFyeT0tLS0t LS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0zMjY5NzIxNzIzMjY3MjcNCi4uLg0KDQotLS0tLS0tLS0t LS0tLS0tLS0tLS0t
LS0tLS0tLTMyNjk3MjE3MjMyNjcyNw0KQ29udGVudC1EaXNwb3NpdGlvbjog Zm9ybS1kYXRhOyBu
YW1lPSJkZS11dGY4Ig0KDQrDg8KEw4PCpMODwpbDg8K2w4PCnMODwrwNCi0t LS0tLS0tLS0tLS0t
LS0tLS0tLS0tLS0tLS0tMzI2OTcyMTcyMzI2NzI3DQoNCndoaWNoIG1lYW5z IHRoYXQgdGhlcmUg
aXMgbm8gY2hhcnNldCBoZWFkZXIgdG8gdGhlIHBhcnRzIGVpdGhlci4NCg==

Re: CGI and character encoding

am 24.02.2011 23:33:56 von Cees Hek

Hi Andr=E9,

There is a perlmonks post from a few years ago that explains one way
of automating this with CGI.pm. I've used this for several years now
without problems.

http://www.perlmonks.org/?node_id=3D651574

Just remember that decoding params is just one part of dealing with
utf-8. You need to worry about any data coming into or going out of
your app (reading files, retrieving from DB, send HTML out to the
browser, etc...). The following wiki book has some great information
on how to deal with utf-8 in your perl applications (and it also
includes the CGI.pm hack from Rhesa that I linked to above in the
perlmonks link).

http://en.wikibooks.org/wiki/Perl_Programming/Unicode_UTF-8

Cheers,

Cees Hek


On Fri, Feb 25, 2011 at 8:31 AM, Andr=E9 Warnier wrote:
> Hi.
>
> I wonder if someone here can give me a clue as to where to look...
>
> I am using
> Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 wi=
th
> Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0
> mod_perl/2.0.4 Perl/v5.10.0
>
> perl -MCGI -e 'print $CGI::VERSION'
> 3.52
>
> A perl cgi-bin script running under mod_perl, receives posted form
> parameters from a form defined as such :
>
> > =A0 =A0 =A0 "http://www.w3.org/TR/html4/loose.dtd">
>
> =A0 =A0 =A0 =A0
> =A0 =A0 =A0 =A0 rset=3DUTF-8">
> ....
> =A0
> =A0 =A0 =A0 =A0 > =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0enctype=3D"multipart/form-data" charset=3D=
"UTF-8" method=3D"POST">
> ...
>
> ...
>
> (Note: the html page itself has been saved as UTF-8 by an UTF-8 aware
> editor)
>
>
> When I retrieve the above hidden field using
>
> my $chars =3D $cgi->param('de-utf8');
>
> the variable $chars does contain the proper UTF-8 encoded *bytes* for the
> above string (in other words, 2 bytes per character e.g.), but it arrives
> into the script /without/ the perl "utf8" flag set.
>
> If I then use this value to print to a filehandle opened as such :
>
> open(FH,'>:utf8',"myfile");
> print FH $chars,"\n";
>
> It comes out of course as .. well, I cannot type this on my keyboard, but
> anyone aware of double-encoding issues can imagine the "A-tilde Copyright
> A-tilde squiggle.. " result.
>
> I can of course convert it, by using
>
> $chars =3D Encode::decode('utf8',$cgi->param('de-utf8'));
>
> but it is a p.i.t.a. and I would like to know if there is a way to retrie=
ve
> the posted value directly as UTF-8, and if yes what this depends on.
> (I cannot find a setting for instance in the CGI.pm module documentation.=
)
>
>
> Thanks.
> Andr=E9
>
> P.S.
> Unfortunately, when the browser (Firefox 3.5.3) is posting this data to t=
he
> server, it is posting it as something like
>
> ...
> Content-Type =A0 =A0multipart/form-data;
> boundary=3D---------------------------326972172326727
> ...
>
> -----------------------------326972172326727
> Content-Disposition: form-data; name=3D"de-utf8"
>
> ÄäÖöÜü
> -----------------------------326972172326727
>
> which means that there is no charset header to the parts either.
>

Re: CGI and character encoding

am 24.02.2011 23:41:42 von aw

Michael Peters wrote:
> On 02/24/2011 04:31 PM, André Warnier wrote:
>
>> I wonder if someone here can give me a clue as to where to look...
>
> The CGI.pm documentation talks about the -utf8 import flag which is
> probably what you're looking for. But it does caution not to use it for
> anything that needs to do file uploads.
>

Thanks. My workstation version of the CGI documentation is apparently outdated, and did
not mention that "pragma". The CPAN version does.
But yes, I will need file uploads too, and since there is no telling how exactly the -utf8
flag interferes with them, I think I'll stick with the p.i.t.a. method for now.

I wonder why browsers do not put a charset parameter in the multipart/form-data parts..
It would seem like a logical and MIME-conformant thing to do.

Re: CGI and character encoding

am 25.02.2011 05:17:48 von Michael Schout

On 02/24/2011 03:31 PM, André Warnier wrote:
> Hi.
>
> I wonder if someone here can give me a clue as to where to look...

If you have a fairly recent CGI.pm, it will decode utf-8 properly for
you (even avoiding double-decoding), but there are some caveats. In
addition to what others have already said, If you are running under
mod_perl (which obviously you are), CGI.pm adds a cleanup handler (via
register_cleanup) which resets CGI.pm's global variables. One of the
variables that gets reset is the PARAM_UTF8 variable (which the -utf8
import controls). Because of this, once the clenaup handler gets
called, UTF-8 decoding gets turned off.

You have to work around this by manually making sure $CGI::PARAM_UTF8 =
1 before calling CGI->new.

Regards,
Michael Schout

Re: CGI and character encoding

am 25.02.2011 21:48:43 von aw

Thanks to Michael, Michael, Lloyd, Cees,

your answers and insights have made things clearer for me.
I think I'll use a combination of all of that for this new application we're writing.

In other words, to program "defensively", I propose to do this :

when sending the html page with the :
- create the page and save it as UTF-8
- have the proper charset indications in it
- include a hidden test field with some known UTF-8 sequence (e.g. "ÄÖÜ")
- make sure that the application and the webserver send out the page with the proper
Content-type and charset (HTTP headers)

But since we still don't know what the browser (and the user) will actually do with this,

upon reception of the POST :
- get the test field and check how it was received
a) check if it has the "is_utf8()" flag set (probably not)
b) if not (a) check if at least it has the correct UTF-8 bytes in it (6, not 3)
c) if nor (a) nor (b), reject with error (don't know what it is then)
d) if not (a), but (b), then set a flag 'must_decode'

- get the other parameters, and
- if the 'must_decode' flag is not set, leave them 'as is'
- if the flag is set, Encode::decode('utf8',..) all received
parameters, except for file uploads (*)

That's of course in the hope that, some day, browsers will send multipart data with the
proper charset indication, and that CGI.pm will take it into account and do the right thing.



(*) although a question then is how a Polish browser would send the filename attribute,
assuming it is originally something like "Qualitätsübersicht.pdf"