potential changes to Locale-PO
am 26.02.2006 15:45:36 von Kalle Olavi NiemitaloI have been using Locale-PO-0.16 for two scripts that are now in
the source tree of ELinks
have patched PO.pm to add new features and fix bugs, and mailed
the patches to the maintainer Alan Schwartz.
New features:
- Locale::PO supports obsolete entries.
- The PO file parser no longer require newlines between entries.
- The PO file parser tries to preserve even semantically
insignificant newlines in strings.
- The PO file parser remembers the line number where each msgid
or msgstr begins.
- The save_file method returns undef and remembers $! if print
fails.
Bug fixes:
- Locale::PO preserves the complete set of flags in each entry,
even those flags that it does not directly support.
- The PO file parser compares names of flags exactly and
case-sensitively, like GNU Gettext does. It no longer
truncates e.g. objc-format to c-format.
- The php-format flag is now tristate, like c-format.
- The PO file parser binds $/ and $_ dynamically, thus insulating
itself from the caller.
- The dump method dumps comments even if they are eq "0".
Documentation changes:
- Copied the copyright notice from README to PO.pm itself.
- Documented quoting and newlines in strings passed to/from methods.
- Documented the php_format, load_file, and save_file methods.
- Documented error handling in load_file_asarray and load_file_ashash.
- Documented the bugs that I know of.
- Separated getter and setter synopses from each other. Also,
repeat the synopsis above the description of each method.
Other changes:
- "use fields" and "my Locale::PO".
- Renamed normalize_str to _normalize_str, and dump_multi_comment
to _dump_multi_comment.
- Locale::PO objects store the flags in a different format.
- Flag-setting functions silently map unsupported values
(e.g. 42) to supported ones (e.g. 1), which they also return.
- It is possible to get a Locale::PO object without a msgid by
loading an invalid PO file. Writing such an entry back out
does not generate a msgid, either.
I have not yet updated the tests, primarily because I've been
working in the ELinks source tree and importing the tests there
did not seem right. (ELinks uses Locale::PO at build time only;
it doesn't install the patched version to the user's system.)
I intend to rectify this after I install some scripts to help
propagate changes between version control systems.
Now, Alan Schwartz has suggested that I take over the Locale-PO
module. I am afraid of doing that: I don't know how long my
interest in this module will last, and I don't want to become
trapped in supporting it. I don't currently have a PAUSE
account, either. So, I'd like to know the c.l.p.m opinions on
such a change.
In any case, whether I become the maintainer or just submit
patches, I think it would be good to get in touch with the users
of the module, so that I could be sure that the changes are going
in the right direction and don't gratuitously break people's
programs. Specifically:
- How important is it to run fast and use little memory?
- Is it necessary to support anything older than Perl 5.6.0?
- If a malformed PO file is being loaded, do you want warnings
during the load, afterwards, or not at all?
- Do you access the hash of Locale::PO directly?
- Do you define subclasses of Locale::PO?
- Do you define any variables as 'my Locale::PO $foo' or check that
'$foo->isa("Locale::PO")' or that 'ref($foo) eq "Locale::PO"'?
- The msgstr_n method returns a reference to a hash. Do you
modify that hash? If so, do you expect the modifications
to affect the Locale::PO object?
- The msgid, msgid_plural, msgstr, and msgstr_n accessor methods
return strings in one format and want new strings in a
different format. I'd like to straighten this out so that the
same format can be used in both directions. Also, I'd like to
make it possible and hopefully even easy to get the string with
all \n etc. backslash sequences expanded out. How should these
things be done in a compatible way?
(a) Keep the inconsistency:
$po->msgid($po->dequote($po->msgid()))
(b) New methods for different formats:
$po->msgid_quoted($po->msgid_quoted())
(c) First arg is a hash of options:
$po->msgid({-quote=>"full"}, $po->msgid({-quote=>"full"}))
This would require extra trickery with msgstr_n, which
already takes a hash; and it might be too easy to mistype
an option.
- PO files normally declare their charset. In Unicode-capable
Perls, it should be easy for users of Locale::PO to get the
strings converted to Perl's internal Unicode representation.
This applies both to the actual strings and to any comments.
However, for the sake of applications that don't call
bindtextdomain(), Locale::PO should preserve the exact bytes
(including redundant shift sequences) as far as possible.
Note also that some encodings can use the backslash ASCII code
0x5C as part of a multibyte character, which may affect the
quote and dequote methods. How should the Unicode strings be
accessible?
(a) Each Locale::PO object holds the byte strings and the
name of the charset. Methods convert from/to Unicode
when necessary. There are two methods for changing the
charset: one preserves the byte strings, and the other
recodes them.
Con: If you build a Locale::PO object from scratch (as
opposed to loading it from a file), you need to select
the charset before you set any Unicode strings.
Con: If you change the charset in the Content-Type of the
header entry and then save_file_fromarray, the other
entries will keep their previous encoding. To avoid
that, one must loop over the entries and change the
charset of each.
(b) Each Locale::PO object holds a mix of byte strings and
Unicode strings and remembers which is which (or perhaps it
just tests with Encode::is_utf8). It also holds the name
of the charset of the byte strings. Loading from a file
stores byte strings and copies the name of the charset from
the header entry. Saving a string to a file recodes it to
match the charset listed in the Content-Type of the header
entry, unless it is a byte string and the charset is
already correct.
Pro: If one changes neither the Content-Type nor the
strings, then loading a file and writing it back out
does not alter the bytes.
Pro: To recode the file, one only has to change the
Content-Type; one need not know all the fields of
Locale::PO that may hold strings.
???: To change the Content-Type without recoding the
strings, one must loop over the entries and change
their stored charset to match the Content-Type.
But this is unusual to want, so it's OK if it is
cumbersome.
(c) As in (b) but entries hold a pointer to the header entry,
instead of the name of the charset. That way, they can
also access Plural-Forms and whatever.
Con: Too easy to get miscoded strings by changing the
Content-Type.
(d) Locale::PO objects hold byte strings only. There are
separate methods for converting strings to/from Unicode.
Con: Recoding the whole file becomes difficult.
- Should changing msgid and msgstr wipe out the saved line numbers?
Interested persons may find a patched version as po/perl/Locale/PO.pm
in the ELinks GIT repository. An even newer version is temporarily
at
I am not yet the maintainer of Locale-PO, this is not a formal
release, and future releases may be incompatible with these versions.