CGI.pm: encoding problems

CGI.pm: encoding problems

am 09.06.2006 16:53:08 von benkasminbullock

I have a problem with inputing utf-8 via a text window using CGI.pm. This
problem concerns UTF8 so apologies for posting something with Chinese
characters in it.

The following code is a minimal working example of the problem with a lot of
extraneous material removed. It needs to be run under a web server to see
the problem. When the text is submitted using the form, the default text of
Chinese characters (they are the numbers from one to four) are munged into
some gibberish stuff, and the test of the input, which checks whether the
input is valid Chinese numerals, fails:

Input text:

一二三四

Output of program:

Input 一二三四 was not a valid number

Thank you very much for any assistance, suggestions or advice about this
problem.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Begin script (to end of message)

#!/usr/bin/perl
use warnings;
use strict;
use CGI;
use utf8;
binmode (STDOUT, ":utf8");
my $query = CGI->new();
$query->charset('UTF-8');
print $query->header();
my $kanji;
if ($query->param('kanji')) {
my $inputnumber = $query->param('kanji');
if ($inputnumber =~ /^([一二三四五六七八九十]+)$/) {
$kanji = $1;
} else {
print "

Input $inputnumber was not a valid number

";
$kanji = "";
}
} else {
$kanji = "一二三四";
}
print $query->start_form(-method => 'POST',-action => $query->url());
print $query->textarea(-name => 'kanji',
-default => $kanji);
print $query->submit();
print $query->endform();
print "\n\n", "
Value",
$kanji, "
\n\n

\n";
print $query->end_html();

Re: CGI.pm: encoding problems

am 09.06.2006 23:16:23 von rvtol+news

Ben Bullock schreef:

> use warnings;
> use strict;
> use CGI;
> use utf8;
> binmode (STDOUT, ":utf8");

Try to replace those 5 lines with these (reordered) 4:

use strict;
use warnings;
use encoding 'utf8' ;
use CGI;

This would also set the PerlIO layer of STDIN to ':utf8'.

See perldoc encoding.

--
Affijn, Ruud

"Gewoon is een tijger."

Re: CGI.pm: encoding problems

am 10.06.2006 00:00:31 von mumia.w.18.spam+nospam.usenet

Ben Bullock wrote:
> I have a problem with inputing utf-8 via a text window using CGI.pm.
> This problem concerns UTF8 so apologies for posting something with
> Chinese characters in it.
>
> The following code is a minimal working example of the problem with a
> lot of extraneous material removed. It needs to be run under a web
> server to see the problem. When the text is submitted using the form,
> the default text of Chinese characters (they are the numbers from one to
> four) are munged into some gibberish stuff, and the test of the input,
> which checks whether the input is valid Chinese numerals, fails:
>
> Input text:
>
> 一二三四
>
> Output of program:
>
> Input 一二三四 was not a valid number
>
> Thank you very much for any assistance, suggestions or advice about this
> problem.
>
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> Begin script (to end of message)
>
> #!/usr/bin/perl
> use warnings;
> use strict;
> use CGI;
> use utf8;
> binmode (STDOUT, ":utf8");
> my $query = CGI->new();
> $query->charset('UTF-8');
> print $query->header();
> my $kanji;
> if ($query->param('kanji')) {
> my $inputnumber = $query->param('kanji');
> if ($inputnumber =~ /^([一二三四五六七八九十]+)$/) {
> $kanji = $1;
> } else {
> print "

Input $inputnumber was not a valid number

";
> $kanji = "";
> }
> } else {
> $kanji = "一二三四";
> }
> print $query->start_form(-method => 'POST',-action => $query->url());
> print $query->textarea(-name => 'kanji',
> -default => $kanji);
> print $query->submit();
> print $query->endform();
> print "\n\n", "
Value",
> $kanji, "
\n\n

\n";
> print $query->end_html();
>

I made a few changes to your program. I don't know exactly what the
problem is, but I hope that this sheds some light on it:

#!/usr/bin/perl
use warnings;
use strict;
use CGI;
use utf8;
use Encode (); # changed
binmode (STDOUT, ":utf8");
my $query = CGI->new();
$query->charset('UTF-8');
print $query->header('-cache-control' => 'no-cache'); # changed

my $kanji;
if ($query->param('kanji')) {
my $inputnumber = $query->param('kanji');

print <

Interesting decodings of
"$inputnumber"

UTF-8: @{[ Encode::decode('utf8', $inputnumber) ]}





EOF

# Add this to decode the number:
$inputnumber = Encode::decode('utf8', $inputnumber);

if ($inputnumber =~ /^([一二三四五六七八九十]+)$/) {
$kanji = $1;
} else {
print "

Input $inputnumber was not a valid number

";
$kanji = "";
}
} else {
$kanji = "一二三四";
}

print <

The value if \$kanji is: $kanji



EOF

print $query->start_form(
-method => 'POST',
-action => $query->url()
);
print $query->textarea(-name => 'kanji',
-default => $kanji);

print <
EOF

print $query->submit();
print $query->endform();
print "\n\n", "
Value",
$kanji, "
\n\n

\n";
print $query->end_html();

Re: CGI.pm: encoding problems

am 10.06.2006 02:13:35 von mumia.w.18.spam+nospam.usenet

Dr.Ruud wrote:
> Ben Bullock schreef:
>
>> use warnings;
>> use strict;
>> use CGI;
>> use utf8;
>> binmode (STDOUT, ":utf8");
>
> Try to replace those 5 lines with these (reordered) 4:
>
> use strict;
> use warnings;
> use encoding 'utf8' ;
> use CGI;
>
> This would also set the PerlIO layer of STDIN to ':utf8'.
>
> See perldoc encoding.
>

I still get the problem when running Ben's program. The problem is that
using the CGI module to initialize the textarea works the first time and
not the second; however, bypassing CGI.pm and writing the textarea
directly using print seems to work consistently.

The bug might be logic related, but it's more likely CGI.pm-related.

There is a "hint" that the CGI.pm on my Sarge system is not UTF-8 ready.
This appears at the top of every page of output:


This happens even when the HTTP header says utf8.

Re: CGI.pm: encoding problems

am 10.06.2006 07:41:24 von benkasminbullock

Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I was
able to get this working, but I also noticed a couple of interesting
phenomena in debugging this program. As Mumia W. says the text in the box is
done incorrectly. Also, if I use my own " and if I use the "straight" function calls of CGI.pm rather than the
object-oriented ones, things stop working again, so it does look rather like
there is something wrong inside CGI.pm. If anyone is interested, let me know
and I'll post example code.

Thanks again.

Re: CGI.pm: encoding problems

am 10.06.2006 15:20:19 von mumia.w.18.spam+nospam.usenet

Ben Bullock wrote:
> Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I
> was able to get this working, but I also noticed a couple of interesting
> phenomena in debugging this program. As Mumia W. says the text in the
> box is done incorrectly. Also, if I use my own " > mangled, and if I use the "straight" function calls of CGI.pm rather
> than the object-oriented ones, things stop working again, so it does
> look rather like there is something wrong inside CGI.pm. If anyone is
> interested, let me know and I'll post example code.
>
> Thanks again.
>

How were you able to get it working? Re-ordering the prologue and using
utf8 didn't work for me.

Re: CGI.pm: encoding problems

am 10.06.2006 15:20:20 von mumia.w.18.spam+nospam.usenet

Ben Bullock wrote:
> Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud I
> was able to get this working, but I also noticed a couple of interesting
> phenomena in debugging this program. As Mumia W. says the text in the
> box is done incorrectly. Also, if I use my own " > mangled, and if I use the "straight" function calls of CGI.pm rather
> than the object-oriented ones, things stop working again, so it does
> look rather like there is something wrong inside CGI.pm. If anyone is
> interested, let me know and I'll post example code.
>
> Thanks again.
>

It's not a bug; it's a feature ;)

For whatever reason, on my system, CGI.pm always interprets the STDIN
data in raw mode, regardless of the script encoding, so form elements
have to be explicitly decoded.

And CGI.pm has a nifty feature that allows the programmer to
automatically create forms with the same values that were in the posted
data.

These two behaviors combine to create the problems you had. The
workarounds are to explicitly decode the form elements and to delete the
old form element before creating another one with the same name.

This program should demonstrate the issue and workarounds:

#!/usr/bin/perl
# kanji-2.cgi
use strict;
use warnings;
use encoding 'utf8';
use CGI ();
use CGI::Carp 'fatalsToBrowser';

$\ = "\n";

# Invoke this script without a query string to
# get the default (broken) behavior.
#
# Invoke this script with a query string of 'recode'
# to get the 'kanji' form element recoded into
# utf8. Example:
#
# http://server.com/kanji-2.cgi?recode
#
# Or, if you want the old textarea data deleted
# upon successive invocations of the form, add
# a query string of 'delete' like so:
#
# http://server.com/kanji-2.cgi?delete
my $RECODE_QUERY = 0;
my $DELETE_QUERY = 0;
$RECODE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/recode/;
$DELETE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/delete/;

my $kanji;
my $text;
my $query = new CGI;

print $query->header(
-type => 'text/html',
-charset => 'utf8',
);

print $query->start_html(
-title => 'Kanji Test',
-head => CGI::meta ({-http_equiv => 'Content-Type',
-content => 'text/html; charset=utf8' ,
}),
),
$query->h1('Kanji Test');

print <

Let's see if it's possible to send
and receive kanji numeric characters.


EOF

if (! defined $query->param('kanji')) {

$kanji = "一二三四";

} else {

$kanji = $query->param('kanji');
$kanji = Encode::decode('utf8', $kanji);
my $old_kanji = $query->param('kanji');

if ($RECODE_QUERY) {
$query->param('kanji', $kanji);
}

if ($DELETE_QUERY) {
$query->delete('kanji');
}

($text = <
 The data received was:
ORIGINAL: $old_kanji
DECODED: $kanji

EOF


print $text;
}

my $qs = '' eq $ENV{QUERY_STRING} ? '' :
"?$ENV{QUERY_STRING}" ;

print $query->start_form(
-method => 'POST',
-action => $query->url() . $qs );

print $query->textarea(
-name => 'kanji',
-default => $kanji,
);

print $query->submit();

print $query->end_form();


print $query->end_html;

Re: CGI.pm: encoding problems

am 10.06.2006 18:27:39 von unknown

Mumia W. wrote:
> Ben Bullock wrote:
>
>> Thanks to Dr. Ruud and Mumia W. for their replies. Thanks to Dr. Ruud
>> I was able to get this working, but I also noticed a couple of
>> interesting phenomena in debugging this program. As Mumia W. says the
>> text in the box is done incorrectly. Also, if I use my own " >> box the input is mangled, and if I use the "straight" function calls
>> of CGI.pm rather than the object-oriented ones, things stop working
>> again, so it does look rather like there is something wrong inside
>> CGI.pm. If anyone is interested, let me know and I'll post example code.
>>
>> Thanks again.
>>
>
> It's not a bug; it's a feature ;)
>
> For whatever reason, on my system, CGI.pm always interprets the STDIN
> data in raw mode, regardless of the script encoding, so form elements
> have to be explicitly decoded.
>
> And CGI.pm has a nifty feature that allows the programmer to
> automatically create forms with the same values that were in the posted
> data.
>
> These two behaviors combine to create the problems you had. The
> workarounds are to explicitly decode the form elements and to delete the
> old form element before creating another one with the same name.
>
> This program should demonstrate the issue and workarounds:

Interesting. I found that the following program blew up on the
Encode::decode, but that $kanji_orig appeared to display correctly.
Also, the 'kanji' element displayed correctly even if I did not specify
a query string. Do we have a version problem? I'm

Perl 5.8.6
CGI.pm 3.20
OS: Darwin 7.9.0 (a.k.a. Mac OS X)
Server: Apache 1.3.33
Browser: Firefox 1.5.0.4 (though I doubt this has anything to do with it).

>
#!/usr/local/bin/perl
> # kanji-2.cgi
> use strict;
> use warnings;
> use encoding 'utf8';
> use CGI ();
> use CGI::Carp 'fatalsToBrowser';
>
> $\ = "\n";
>
> # Invoke this script without a query string to
> # get the default (broken) behavior.
> #
> # Invoke this script with a query string of 'recode'
> # to get the 'kanji' form element recoded into
> # utf8. Example:
> #
> # http://server.com/kanji-2.cgi?recode
> #
> # Or, if you want the old textarea data deleted
> # upon successive invocations of the form, add
> # a query string of 'delete' like so:
> #
> # http://server.com/kanji-2.cgi?delete
> my $RECODE_QUERY = 0;
> my $DELETE_QUERY = 0;
> $RECODE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/recode/;
> $DELETE_QUERY = 1 if $ENV{QUERY_STRING} =~ m/delete/;
>
> my $kanji;
> my $text;
> my $query = new CGI;
>
> print $query->header(
> -type => 'text/html',
> -charset => 'utf8',
> );
>
# I found I got redundant meta headers with the original
# script, so:
> print $query->start_html(
> -title => 'Kanji Test',
## -head => CGI::meta ({-http_equiv => 'Content-Type',
## -content => 'text/html; charset=utf8' ,
## }),
> ),
> $query->h1('Kanji Test');
>
> print < >

Let's see if it's possible to send
> and receive kanji numeric characters.
>


> EOF
>
> if (! defined $query->param('kanji')) {
>
> $kanji = "一二三四";
>
> } else {
>
> $kanji = $query->param('kanji');
eval {$kanji = Encode::decode('utf8', $kanji)};
$@ and $kanji = $@;
> my $old_kanji = $query->param('kanji');
>
> if ($RECODE_QUERY) {
> $query->param('kanji', $kanji);
> }
>
> if ($DELETE_QUERY) {
> $query->delete('kanji');
> }
>
> ($text = < >
 The data received was:
> ORIGINAL: $old_kanji
> DECODED: $kanji
>

> EOF
>
>
> print $text;
> }
>
> my $qs = '' eq $ENV{QUERY_STRING} ? '' :
> "?$ENV{QUERY_STRING}" ;
>
> print $query->start_form(
> -method => 'POST',
> -action => $query->url() . $qs );
>
> print $query->textarea(
> -name => 'kanji',
> -default => $kanji,
> );
>
> print $query->submit();
>
> print $query->end_form();
>
>
> print $query->end_html;
>

Tom Wyant

Re: CGI.pm: encoding problems

am 10.06.2006 18:51:42 von mumia.w.18.spam+nospam.usenet

harryfmudd [AT] comcast [DOT] net wrote:
> Mumia W. wrote:
>> [...]
>> This program should demonstrate the issue and workarounds:
>
> Interesting. I found that the following program blew up on the
> Encode::decode, but that $kanji_orig appeared to display correctly.
> Also, the 'kanji' element displayed correctly even if I did not specify
> a query string. Do we have a version problem? [...]

Quite likely. I have perl 5.8.4 and CGI.pm 3.04 (old). That's probably
why Dr. Ruud's advice of moving the "use" statements around didn't work
for me.

So it seems that re-decoding the data is a bad idea with newer versions
of the module. As you were everybody.

Re: CGI.pm: encoding problems

am 14.06.2006 09:26:10 von benkasminbullock

If anyone cares, the original program is on the web as follows:

http://www.sljfaq.org/cgi/numbers.cgi
http://www.sljfaq.org/cgi/kanjinumbers.cgi

The bottom one was the one with the problems.

Ordering the statements correctly solved the problem with the encoding, but
some problems remained.

Thanks for the help.

Re: CGI.pm: encoding problems

am 14.06.2006 23:45:19 von mumia.w.18.spam+nospam.usenet

Ben Bullock wrote:
> If anyone cares, the original program is on the web as follows:
> [...]
> http://www.sljfaq.org/cgi/kanjinumbers.cgi
>
> [...]

I'm not having any problems with it. Am I supposed to?

Re: CGI.pm: encoding problems

am 15.06.2006 01:39:21 von benkasminbullock

"Mumia W." wrote in message
news:Pr%jg.13048$921.9261@newsread4.news.pas.earthlink.net.. .
> Ben Bullock wrote:
>> If anyone cares, the original program is on the web as follows:
>> [...]
>> http://www.sljfaq.org/cgi/kanjinumbers.cgi
>>
>> [...]
>
> I'm not having any problems with it. Am I supposed to?

No, not really. But one interesting problem occurs if you type in numbers
like this:

一ニ三四五xyz

then the xyz is preserved after you convert. If you go the other way round,

12345xyz

then the xyz disappears. The code is exactly the same going either way, so
you tell me why that should be.

Re: CGI.pm: encoding problems

am 15.06.2006 14:28:08 von mumia.w.18.spam+nospam.usenet

Ben Bullock wrote:
> "Mumia W." wrote in
> message news:Pr%jg.13048$921.9261@newsread4.news.pas.earthlink.net.. .
>> Ben Bullock wrote:
>>> If anyone cares, the original program is on the web as follows:
>>> [...]
>>> http://www.sljfaq.org/cgi/kanjinumbers.cgi
>>>
>>> [...]
>>
>> I'm not having any problems with it. Am I supposed to?
>
> No, not really. But one interesting problem occurs if you type in
> numbers like this:
>
> 一ニ三四五xyz
>
> then the xyz is preserved after you convert. If you go the other way round,
>
> 12345xyz
>
> then the xyz disappears. The code is exactly the same going either way,
> so you tell me why that should be.

I don't know, but perhaps you can create your own character class that
matches only numbers from the various languages you're using.