Re: join("") somehow changes characters after "z"

am 09.10.2007 23:13:29 von Ben Morrow

[I have re-wrapped Paul's text to fit in 79 real columns (bytes as
opposed to characters): as a general rule, it's a lot easier when
dealing with charset problems to run your output through od so we can
see what's actually happening]

Quoth Paul Lalli :
> I understand diddlysquat about "wide" characters, characters vs bytes,
> etc. And for the first time, it's biting me. Can someone please
> point me to some resource that will help me to understand why these
> two one liners produce such drastically different output?
>
> This is the only time I can ever remember seeing a list of characters
> being printed differently than a string comprised of those same
> characters, joined by the empty string.
>
> $ perl -le'print map { chr($_) } grep { (chr($_) =~ /\p{IsAlpha}/) }
> (1..256);'
> Wide character in print at -e line 1.

This is your first problem; solving this also incidentally removes the
problem you were asking about... :)

Perl doesn't know what character encoding you are expecting on STDOUT.
As a result, it is printing the raw bytes of its own internal
representation, which changes if you do something requiring Unicode.
It's all rather a mess, mostly as a result of trying to support both
Unicode-aware and non-Unicode-aware programs without breaking backwards
compatibility.

To tell Perl what encoding you are expecting on STDOUT, you push an
:encoding layer with binmode:

binmode STDOUT, ':encoding(utf8)';

for instance, which says to use Perl's slightly lax form of UTF-8. (The
principal difference is that Perl allows you to use codepoints with no
assigned Unicode characters: if you ask instead for :encoding(UTF-8)
then you get the strict version.)

BTW, don't be tempted to use the :utf8 layer directly. While it works
well enough for output, on input it causes a real mess if it gets fed
invalid data.

> ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÂªÂµÂºÃ ÃÃÃÃÃ
> ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ Ã¡Ã¢Ã£Ã¤ Ã¥Ã¦Ã§Ã¨Ã©ÃªÃ«
> Ã¬ÃÃ®Ã¯Ã°Ã±Ã²Ã³Ã´ÃµÃ¶Ã¸Ã¹ÃºÃ»Ã¼Ã½Ã¾Ã¿Ã

Here every character except the last can be represented by a single
ISO8859-1 byte, so internally it is. Since Perl is dumping its raw
representation on STDOUT, every character except the last is output as
one byte.

(Something somewhere must have converted Perl's output to UTF8, as
that's what arrived here: probably your newsreader and/or your
terminal...)

> $ perl -le'print join "", map { chr($_) } grep { (chr($_) =~ /
> \p{IsAlpha}/) } (1..256);'
> Wide character in print at -e line 1.
> ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzÃÂªÃÂµ ÃÂºÃÃÃÃÃ
>
> ÃÃÃÃâÃâÃâÃâÃâ¢ÃâÃËÃâ¢ÃÅ¡ÃâºÃÃÃ ÃÃÂ ÃÂ¡ÃÂ¢
> ÃÂ£ÃÂ¤ÃÂ¥ÃÂ¦ÃÂ§ÃÂ¨ÃÂ©ÃÂªÃÂ«ÃÂ¬ÃÂÃÂ®ÃÂ¯ÃÂ°Ã
> Â±ÃÂ²ÃÂ³ÃÂ´ÃÂµÃÂ¶ÃÂ¸ÃÂ¹ÃÂºÃÂ»ÃÂ¼ÃÂ½ÃÂ¾ÃÂ¿Ã

Here you join the characters into a single string before printing; this
forces Perl to 'upgrade' the whole lot to UTF8 so it can represent the
last character. Thus, the output here is actually valid UTF8, it's just
that your terminal (or whatever) is interpreting it as ISO8859-1 and
so seeing a whole lot of extra characters.

If you set an :encoding you get the same behaviour in either case:
output in the charset you asked for, with a (configurable) error if you
try to print something that can't be represented (only possible with
non-UTF charsets).

> Before anyone asks, yes I did see this notation in `perldoc -f chr`:
> Note that characters from 128
> to 255 (inclusive) are by default not encoded in
> UTF-8 Unicode for backward compatibility reasons
> but I don't really grok what it means, or why it would make the two
> prints different.

The Unicode documentation in 5.8 is rather patchy: it's written from the
point of view of someone who understands the implementation (which is
slightly weird) and wants to list the various gotchas. If and when you
have a good grasp of all this from a user's point of view, a doc patch
or two would certainly be appreciated... :)

What this is saying is that the string returned by chr(129) is
internally represented as one byte, but will need to be changed into two
bytes if it is joined to a UTF8-encoded string: exactly as happened to
you. This is all something that you as a user should not need to be
concerned with, but occasionally imperfections in the implementation
allow it to show through.

Ben

Re: join("") somehow changes characters after "z"

am 10.10.2007 20:52:17 von hjp-usenet2

On 2007-10-09 21:13, Ben Morrow wrote:
> Perl doesn't know what character encoding you are expecting on STDOUT.
> As a result, it is printing the raw bytes of its own internal
> representation,

No, it isn't. It prints strings which contain only characters
in the range [0 .. 255] as 1 byte per character, and strings which
contain characters outside of this range as 1 utf8 sequence per
character. This is independent of how the strings are represented
internally. Consider this:

#!/usr/bin/perl
use utf8;

my $x = "\x{B0}";
utf8::upgrade($x);

print STDERR utf8::is_utf8($x) ? "wide\n" : "byte\n";

print $x;
__END__

% ./foo | od -tx1
wide
0000000 b0
0000001

After the upgrade, $x is internally represented as a wide string (as can
be seen from the output "wide" on STDERR), put it still prints only one
byte to STDOUT.

hp

--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Sysadmin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"

Re: join("") somehow changes characters after "z"

am 10.10.2007 22:22:36 von Ben Morrow

Quoth "Peter J. Holzer" :
> On 2007-10-09 21:13, Ben Morrow wrote:
> > Perl doesn't know what character encoding you are expecting on STDOUT.
> > As a result, it is printing the raw bytes of its own internal
> > representation,
>
> No, it isn't. It prints strings which contain only characters
> in the range [0 .. 255] as 1 byte per character, and strings which
> contain characters outside of this range as 1 utf8 sequence per
> character. This is independent of how the strings are represented
> internally.

Thanks for the correction. Hmm, that's just... *broken*. The results are
guaranteed to be absolutely useless... it would be better to print ?s,
or print nothing; especially as characters over 255 were new in 5.8 so
there were no back-compat issues.

Ben