UTF8 strings and filesystem access

am 11.10.2007 01:11:11 von ansok

One way to access the files in a directory is

opendir DH, $dir or die "opendir: $!";
while (my $file = readdir DH) {
next unless -f "$dir/$file";
# do whatever needs to be done with "$dir/$file";
}

However, this fails given the combination of two facts:
1) $dir is encoded internally in UTF8 (even if $dir doesn't
contain any non-ASCII characters)
2) $file contains non-ASCII characters

The string "$dir/$file" becomes UTF8-encoded, and while it
prints correctly, and compares equal to the same string not
UTF8-encoded, apparently the internal encoding is used
in a stat() (or open()) call, which then fails with $! being
"No such file".

Is there a way to work around this without needing to
transcode all strings that might be UTF8-encoded? $dir is
being read in from a config file using a module (XML::Simple),
so I don't have a lot of control over how it's initialized.

I know I could recast the code to chdir() to $dir, but that
would be a significant change given the current code structure.

This is on Solaris, using 5.8.0, though I've verified
similar behavior on Windows with 5.8.7. I've tried different
settings for LC_ALL, and it doesn't seem to make a difference.

Below is a more complete program to demonstrate the bug. It
assumes that a directory "t2" already exists, with
suitably-named file in it (I used "fil\351.txt")

Thanks,
Gary Ansok

#! /opt/perl/5.8.0/bin/perl

use strict;
use warnings;

my $show_bug = 1;

my $dir = 't2';
if ($show_bug) { # force $dir to be UTF8-encoded
$dir .= "\x{100}";
chop $dir;
}

print "Opening dir '$dir'\n";
opendir DH, $dir or die "opendir: $!";

while (my $file = readdir DH) {
print "Checking file '$dir/$file'\n";
next unless -f "$dir/$file";
print "Found file '$dir/$file'\n";
}

Re: UTF8 strings and filesystem access

am 11.10.2007 02:31:13 von Ben Morrow

Quoth ansok@alumni.caltech.edu (Gary E. Ansok):
> One way to access the files in a directory is
>
> opendir DH, $dir or die "opendir: $!";
> while (my $file = readdir DH) {
> next unless -f "$dir/$file";
> # do whatever needs to be done with "$dir/$file";
> }
>
> However, this fails given the combination of two facts:
> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
> contain any non-ASCII characters)
> 2) $file contains non-ASCII characters
>
> The string "$dir/$file" becomes UTF8-encoded, and while it
> prints correctly, and compares equal to the same string not
> UTF8-encoded, apparently the internal encoding is used
> in a stat() (or open()) call, which then fails with $! being
> "No such file".
>
> Is there a way to work around this without needing to
> transcode all strings that might be UTF8-encoded?

No, not with current versions of perl. All interactions with the system
use raw byte-strings[1], so you will need to encode them correctly in
your local character set for open, and decode them from readdir.

Ben

[1] The -C switch used to switch to the Unicode API on Win32, but noone
used it and the switch was removed in 5.8.1.

Re: UTF8 strings and filesystem access

am 11.10.2007 22:28:16 von hjp-usenet2

On 2007-10-11 00:31, Ben Morrow wrote:
>
> Quoth ansok@alumni.caltech.edu (Gary E. Ansok):
>> One way to access the files in a directory is
>>
>> opendir DH, $dir or die "opendir: $!";
>> while (my $file = readdir DH) {
>> next unless -f "$dir/$file";
>> # do whatever needs to be done with "$dir/$file";
>> }
>>
>> However, this fails given the combination of two facts:
>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
>> contain any non-ASCII characters)

Then why is it a wide string?

>> 2) $file contains non-ASCII characters
>>
>> The string "$dir/$file" becomes UTF8-encoded, and while it
>> prints correctly, and compares equal to the same string not
>> UTF8-encoded, apparently the internal encoding is used
>> in a stat() (or open()) call, which then fails with $! being
>> "No such file".
>>
>> Is there a way to work around this without needing to
>> transcode all strings that might be UTF8-encoded?
>
> No, not with current versions of perl. All interactions with the system
> use raw byte-strings[1], so you will need to encode them correctly in
> your local character set for open, and decode them from readdir.

or alternatively, treat file names as opaque byte strings.

> [1] The -C switch used to switch to the Unicode API on Win32, but noone
> used it and the switch was removed in 5.8.1.

The switch is still there but it does something different now: It
controls whether I/O streams and command line parameters are in UTF-8.
I use

#!/usr/bin/perl -CSAL

quite often.

hp

--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Sysadmin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"

Re: UTF8 strings and filesystem access

am 12.10.2007 00:22:22 von ansok

In article ,
Peter J. Holzer wrote:
>> Quoth ansok@alumni.caltech.edu (Gary E. Ansok):
>>>
>>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
>>> contain any non-ASCII characters)
>
>Then why is it a wide string?

It's read in using XML::Simple from a config file that does not
contain any non-ASCII characters, or any encoding specification in
the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

Now that I've dug a little deeper, I think upgrading some of our
module versions may help avoid this problem -- a recent change to
XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
independent of document encoding".

The module versions we're using:
XML::Simple 2.16, XML::SAX 0.12, XML::LibXML 1.52, libxml2.so.2.6.26

Gary

Re: UTF8 strings and filesystem access

am 14.10.2007 15:33:24 von hjp-usenet2

On 2007-10-11 22:22, Gary E. Ansok wrote:
> In article ,
> Peter J. Holzer wrote:
>>> Quoth ansok@alumni.caltech.edu (Gary E. Ansok):
>>>>
>>>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
>>>> contain any non-ASCII characters)
>>
>>Then why is it a wide string?
>
> It's read in using XML::Simple from a config file that does not
> contain any non-ASCII characters, or any encoding specification in
> the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).

The prolog really can't (or at least shouldn't) make any difference: It
specifies how the file is encoded, but the result of parsing the file is
always text which possibly contains wide characters.

You should decide whether you want to treat filenames as text or as
byte strings within your script.

If you want to treat them as text (e.g. because you want to do
operations like case-mapping, substrings, etc. on them), explicitely
encode them with the local character set just before using them in open,
stat, etc.

$dir_as_text = $xml_simple->{foo}{dir};
$filename_as_text = $xml_simple->{foo}{bar}[42]{title};
$filename_as_text = lc(substr($filename_as_text, 0, 20));
$filename_as_text = "$dir_as_text/$filename_as_text.pdf";
$filename_as_bytes = encode('us-ascii', $filename_as_text);
open($fh, '<', $filename_as_bytes);

If you want to treat them as byte strings, explicitely encode any text
string you get from a different source (in your case, from an XML file)
as early as possible.

$dir_as_bytes = encode('us-ascii', $xml_simple->{foo}{dir});
$filename_as_bytes = "$dir_as_text/$basename_as_bytes.pdf";
open($fh, '<', $filename_as_bytes);

> Now that I've dug a little deeper, I think upgrading some of our
> module versions may help avoid this problem -- a recent change to
> XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
> independent of document encoding".

You omitted an important piece here: The entry reads
"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
$node->toString returns a piece of XML, which always should be a series
of bytes, not characters. I haven't looked at the source code of
XML::Simple, but it probably uses $text->data or $node->nodeValue.

hp

--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Sysadmin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"

Re: UTF8 strings and filesystem access

am 15.10.2007 19:03:25 von ansok

In article ,
Peter J. Holzer wrote:
>On 2007-10-11 22:22, Gary E. Ansok wrote:
>> In article ,
>> Peter J. Holzer wrote:
>>>> Quoth ansok@alumni.caltech.edu (Gary E. Ansok):
>>>>>
>>>>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
>>>>> contain any non-ASCII characters)
>>>
>>>Then why is it a wide string?
>>
>> It's read in using XML::Simple from a config file that does not
>> contain any non-ASCII characters, or any encoding specification in
>> the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).
>
>> Now that I've dug a little deeper, I think upgrading some of our
>> module versions may help avoid this problem -- a recent change to
>> XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
>> independent of document encoding".
>
>You omitted an important piece here: The entry reads
>"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
>$node->toString returns a piece of XML, which always should be a series
>of bytes, not characters. I haven't looked at the source code of
>XML::Simple, but it probably uses $text->data or $node->nodeValue.

I've worked around the problem by switching from XML::LibXML to
XML::SAX::PurePerl as the underlying parser -- now, the string
read in from the configuration file no longer has the UTF8 flag
set, and the problem does not appear.

I still think it's a bug that a string that can successfully opendir()
a directory, combined (including the appropriate separator) with a
file name read in by readdir(), does not result in a string that can
by used to open() or stat() the file. Especially since the path appears
correct when printed as part of an error message, and it's difficult
to diagnose the problem without resorting to something like Devel::Peek.

Thanks for the assistance,
Gary Ansok

Re: UTF8 strings and filesystem access

am 27.10.2007 11:32:50 von hjp-usenet2

On 2007-10-15 17:03, Gary E. Ansok wrote:
> In article ,
> Peter J. Holzer wrote:
>>On 2007-10-11 22:22, Gary E. Ansok wrote:
>>> In article ,
>>> Peter J. Holzer wrote:
>>>>> Quoth ansok@alumni.caltech.edu (Gary E. Ansok):
>>>>>>
>>>>>> 1) $dir is encoded internally in UTF8 (even if $dir doesn't
>>>>>> contain any non-ASCII characters)
>>>>
>>>>Then why is it a wide string?
>>>
>>> It's read in using XML::Simple from a config file that does not
>>> contain any non-ASCII characters, or any encoding specification in
>>> the XML prolog (though adding "encoding='ISO-8859-1'" didn't help).
>>
>>> Now that I've dug a little deeper, I think upgrading some of our
>>> module versions may help avoid this problem -- a recent change to
>>> XML::LibXML mentioned "strip-off UTF8 flag for consistent behavior
>>> independent of document encoding".
>>
>>You omitted an important piece here: The entry reads
>>"strip-off UTF8 flag with $node->toString($format,1) for consistent ..."
>>$node->toString returns a piece of XML, which always should be a series
>>of bytes, not characters. I haven't looked at the source code of
>>XML::Simple, but it probably uses $text->data or $node->nodeValue.

Why did you quote this paragraph? You don't seem to reply to it.

> I've worked around the problem by switching from XML::LibXML to
> XML::SAX::PurePerl as the underlying parser -- now, the string
> read in from the configuration file no longer has the UTF8 flag
> set, and the problem does not appear.

Probably because you have now two bugs which cancel each other out.
The charset handling of XML::SAX::PurePerl is severely broken[0] - don't
use it.

> I still think it's a bug that a string that can successfully opendir()
> a directory, combined (including the appropriate separator) with a
> file name read in by readdir(), does not result in a string that can
> by used to open() or stat() the file.

I agree. However, the opendir() only worked accidentally in your code
because the directory name just happened to contain only characters <=
0x7F. If it had contained a character >= 0x80 (like the file name you
read) it would have failed, too. It is the nature of buggy code that it
appears to work sometimes. The real fix is to explicitely encode/decode
strings as required.

> Especially since the path appears correct when printed as part of an
> error message, and it's difficult to diagnose the problem without
> resorting to something like Devel::Peek.

I think that open should work the same whether the filename argument
is a wide or narrow string. But I'm not sure how it should behave: There
are arguments for viewing a file name as a sequence of bytes and for
viewing it as a sequence of characters. The latter is usually more
convenient, but it makes some tasks impossible (e.g., renaming files
with "illegal" byte sequences). Maybe we need the equivalent of IO
layers for filenames, too. Or at least a flag "take filename encoding
from the locale".

hp

[0] Actually just outdated: The current release is older than perl 5.8,
so it doesn't know about perl 5.8 Unicode support.

--
_ | Peter J. Holzer | I know I'd be respectful of a pirate
|_|_) | Sysadmin WSR | with an emu on his shoulder.
| | | hjp@hjp.at |
__/ | http://www.hjp.at/ | -- Sam in "Freefall"