I wonder if someone here can give me a clue as to where to look...
I am using
Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 with Suhosin-Patch
mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0 mod_perl/2.0.4 Perl/v5.10.0
perl -MCGI -e 'print $CGI::VERSION'
3.52
A perl cgi-bin script running under mod_perl, receives posted form parameters from a form
defined as such :
(Note: the html page itself has been saved as UTF-8 by an UTF-8 aware editor)
When I retrieve the above hidden field using
my $chars = $cgi->param('de-utf8');
the variable $chars does contain the proper UTF-8 encoded *bytes* for the above string (in
other words, 2 bytes per character e.g.), but it arrives into the script /without/ the
perl "utf8" flag set.
If I then use this value to print to a filehandle opened as such :
open(FH,'>:utf8',"myfile");
print FH $chars,"\n";
It comes out of course as .. well, I cannot type this on my keyboard, but anyone aware of
double-encoding issues can imagine the "A-tilde Copyright A-tilde squiggle.. " result.
but it is a p.i.t.a. and I would like to know if there is a way to retrieve the posted
value directly as UTF-8, and if yes what this depends on.
(I cannot find a setting for instance in the CGI.pm module documentation.)
> I wonder if someone here can give me a clue as to where to look...
The CGI.pm documentation talks about the -utf8 import flag which is
probably what you're looking for. But it does caution not to use it for
anything that needs to do file uploads.
There is a perlmonks post from a few years ago that explains one way
of automating this with CGI.pm. I've used this for several years now
without problems.
http://www.perlmonks.org/?node_id=3D651574
Just remember that decoding params is just one part of dealing with
utf-8. You need to worry about any data coming into or going out of
your app (reading files, retrieving from DB, send HTML out to the
browser, etc...). The following wiki book has some great information
on how to deal with utf-8 in your perl applications (and it also
includes the CGI.pm hack from Rhesa that I linked to above in the
perlmonks link).
On Fri, Feb 25, 2011 at 8:31 AM, Andr=E9 Warnier wrote:
> Hi.
>
> I wonder if someone here can give me a clue as to where to look...
>
> I am using
> Apache/2.2.9 (Debian) DAV/2 SVN/1.5.1 mod_jk/1.2.26 PHP/5.2.6-1+lenny9 wi=
th
> Suhosin-Patch mod_ssl/2.2.9 OpenSSL/0.9.8g mod_apreq2-20051231/2.6.0
> mod_perl/2.0.4 Perl/v5.10.0
>
> perl -MCGI -e 'print $CGI::VERSION'
> 3.52
>
> A perl cgi-bin script running under mod_perl, receives posted form
> parameters from a form defined as such :
>
>
> =A0 =A0 =A0 "http://www.w3.org/TR/html4/loose.dtd">
>
> =A0 =A0 =A0 =A0
> =A0 =A0 =A0 =A0
rset=3DUTF-8">
> ....
> =A0
> =A0 =A0 =A0 =A0
> =A0 =A0 =A0 =A0 =A0 =A0 =A0 =A0enctype=3D"multipart/form-data" charset=3D=
"UTF-8" method=3D"POST">
> ...
>
> ...
>
> (Note: the html page itself has been saved as UTF-8 by an UTF-8 aware
> editor)
>
>
> When I retrieve the above hidden field using
>
> my $chars =3D $cgi->param('de-utf8');
>
> the variable $chars does contain the proper UTF-8 encoded *bytes* for the
> above string (in other words, 2 bytes per character e.g.), but it arrives
> into the script /without/ the perl "utf8" flag set.
>
> If I then use this value to print to a filehandle opened as such :
>
> open(FH,'>:utf8',"myfile");
> print FH $chars,"\n";
>
> It comes out of course as .. well, I cannot type this on my keyboard, but
> anyone aware of double-encoding issues can imagine the "A-tilde Copyright
> A-tilde squiggle.. " result.
>
> I can of course convert it, by using
>
> $chars =3D Encode::decode('utf8',$cgi->param('de-utf8'));
>
> but it is a p.i.t.a. and I would like to know if there is a way to retrie=
ve
> the posted value directly as UTF-8, and if yes what this depends on.
> (I cannot find a setting for instance in the CGI.pm module documentation.=
)
>
>
> Thanks.
> Andr=E9
>
> P.S.
> Unfortunately, when the browser (Firefox 3.5.3) is posting this data to t=
he
> server, it is posting it as something like
>
> ...
> Content-Type =A0 =A0multipart/form-data;
> boundary=3D---------------------------326972172326727
> ...
>
> -----------------------------326972172326727
> Content-Disposition: form-data; name=3D"de-utf8"
>
> ̟̊̚
> -----------------------------326972172326727
>
> which means that there is no charset header to the parts either.
>
Thanks. My workstation version of the CGI documentation is apparently outdated, and did
not mention that "pragma". The CPAN version does.
But yes, I will need file uploads too, and since there is no telling how exactly the -utf8
flag interferes with them, I think I'll stick with the p.i.t.a. method for now.
I wonder why browsers do not put a charset parameter in the multipart/form-data parts..
It would seem like a logical and MIME-conformant thing to do.
If you have a fairly recent CGI.pm, it will decode utf-8 properly for
you (even avoiding double-decoding), but there are some caveats. In
addition to what others have already said, If you are running under
mod_perl (which obviously you are), CGI.pm adds a cleanup handler (via
register_cleanup) which resets CGI.pm's global variables. One of the
variables that gets reset is the PARAM_UTF8 variable (which the -utf8
import controls). Because of this, once the clenaup handler gets
called, UTF-8 decoding gets turned off.
You have to work around this by manually making sure $CGI::PARAM_UTF8 =
1 before calling CGI->new.
Regards,
Michael Schout
Re: CGI and character encoding
am 25.02.2011 21:48:43 von aw
Thanks to Michael, Michael, Lloyd, Cees,
your answers and insights have made things clearer for me.
I think I'll use a combination of all of that for this new application we're writing.
In other words, to program "defensively", I propose to do this :