Advice on how to approach character translation

Advice on how to approach character translation

am 23.04.2008 11:34:22 von Chandra

Dear Folks,

A scheme called ITRANS uses the ASCII printing character set and between one and
three printing characters to unambiguously represent characters in Indic
scripts or a Romanized script called IAST. Since characters in these scripts
have Unicode code points, it should be possible to automate the translation
between words in the ASCII source text and the desired Unicoded output text.

I am trying to write a Perl script to do this and would appreciate advice on how
best to proceed before I start.

To give a better picture of what I am trying to do, I have given some examples
below for ASCII to IAST characters:

--------
1. Transliteration of between one and three ASCII printing characters to one
Unicode character.

2. Many characters are unchanged by the transliteration.

3. Some transliteration examples are shown below:

a a U+0061 LATIN SMALL LETTER A
aa ā U+0101 LATIN SMALL LETTER A WITH MACRON
A ā U+0101 LATIN SMALL LETTER A WITH MACRON
..a ' U+0027 APOSTROPHE
~N ṅ U+1E45 LATIN SMALL LETTER N WITH DOT ABOVE
RRI ṝ U+1E5D LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
R^I ṝ U+1E5D LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
--------

Many thanks.

Chandra

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 23.04.2008 17:30:30 von chas.owens

T24gV2VkLCBBcHIgMjMsIDIwMDggYXQgNTozNCBBTSwgUiAoQ2hhbmRyYSkg Q2hhbmRyYXNla2hh
cgo8Y2hhbmRyYUBlZS51d2EuZWR1LmF1PiB3cm90ZToKPiBEZWFyIEZvbGtz LAo+Cj4gIEEgc2No
ZW1lIGNhbGxlZCBJVFJBTlMgdXNlcyB0aGUgQVNDSUkgcHJpbnRpbmcgY2hh cmFjdGVyIHNldCBh
bmQgYmV0d2Vlbgo+IG9uZSBhbmQgIHRocmVlIHByaW50aW5nIGNoYXJhY3Rl cnMgdG8gdW5hbWJp
Z3VvdXNseSByZXByZXNlbnQgY2hhcmFjdGVycyBpbgo+IEluZGljIHNjcmlw dHMgb3IgYSBSb21h
bml6ZWQgc2NyaXB0IGNhbGxlZCBJQVNULiBTaW5jZSBjaGFyYWN0ZXJzIGlu IHRoZXNlCj4gc2Ny
aXB0cyBoYXZlIFVuaWNvZGUgY29kZSBwb2ludHMsIGl0IHNob3VsZCBiZSBw b3NzaWJsZSB0byBh
dXRvbWF0ZSB0aGUKPiB0cmFuc2xhdGlvbiBiZXR3ZWVuIHdvcmRzIGluIHRo ZSBBU0NJSSBzb3Vy
Y2UgdGV4dCBhbmQgdGhlIGRlc2lyZWQgVW5pY29kZWQKPiBvdXRwdXQgdGV4 dC4KPgo+ICBJIGFt
IHRyeWluZyB0byB3cml0ZSBhIFBlcmwgc2NyaXB0IHRvIGRvIHRoaXMgYW5k IHdvdWxkIGFwcHJl
Y2lhdGUgYWR2aWNlCj4gb24gaG93IGJlc3QgdG8gcHJvY2VlZCBiZWZvcmUg SSBzdGFydC4KPgo+
ICBUbyBnaXZlIGEgYmV0dGVyIHBpY3R1cmUgb2Ygd2hhdCBJIGFtIHRyeWlu ZyB0byBkbywgSSBo
YXZlIGdpdmVuIHNvbWUKPiBleGFtcGxlcyBiZWxvdyBmb3IgQVNDSUkgdG8g SUFTVCBjaGFyYWN0
ZXJzOgo+Cj4gIC0tLS0tLS0tCj4gIDEuIFRyYW5zbGl0ZXJhdGlvbiBvZiBi ZXR3ZWVuIG9uZSBh
bmQgdGhyZWUgQVNDSUkgcHJpbnRpbmcgY2hhcmFjdGVycyB0bwo+IG9uZSBV bmljb2RlIGNoYXJh
Y3Rlci4KPgo+ICAyLiBNYW55IGNoYXJhY3RlcnMgYXJlIHVuY2hhbmdlZCBi eSB0aGUgdHJhbnNs
aXRlcmF0aW9uLgo+Cj4gIDMuIFNvbWUgdHJhbnNsaXRlcmF0aW9uIGV4YW1w bGVzIGFyZSBzaG93
biBiZWxvdzoKPgo+ICBhICAgICAgIGEgICBVKzAwNjEgICBMQVRJTiBTTUFM TCBMRVRURVIgQQo+
ICBhYSAgICAgIMSBICAgVSswMTAxICAgTEFUSU4gU01BTEwgTEVUVEVSIEEg V0lUSCBNQUNST04K
PiAgQSAgICAgICDEgSAgIFUrMDEwMSAgIExBVElOIFNNQUxMIExFVFRFUiBB IFdJVEggTUFDUk9O
Cj4gIC5hICAgICAgJyAgIFUrMDAyNyAgIEFQT1NUUk9QSEUKPiAgfk4gICAg ICDhuYUgICBVKzFF
NDUgICBMQVRJTiBTTUFMTCBMRVRURVIgTiBXSVRIIERPVCBBQk9WRQo+ICBS UkkgICAgIOG5nSAg
IFUrMUU1RCAgIExBVElOIFNNQUxMIExFVFRFUiBSIFdJVEggRE9UIEJFTE9X IEFORCBNQUNST04K
PiAgUl5JICAgICDhuZ0gICBVKzFFNUQgICBMQVRJTiBTTUFMTCBMRVRURVIg UiBXSVRIIERPVCBC
RUxPVyBBTkQgTUFDUk9OCj4gIC0tLS0tLS0tCj4KPiAgTWFueSB0aGFua3Mu Cj4KPiAgQ2hhbmRy
YQo+Cj4gIC0tCj4gIFRvIHVuc3Vic2NyaWJlLCBlLW1haWw6IGJlZ2lubmVy cy11bnN1YnNjcmli
ZUBwZXJsLm9yZwo+ICBGb3IgYWRkaXRpb25hbCBjb21tYW5kcywgZS1tYWls OiBiZWdpbm5lcnMt
aGVscEBwZXJsLm9yZwo+ICBodHRwOi8vbGVhcm4ucGVybC5vcmcvCj4KPgo+ CgpUaGUgZWFzaWVz
dCB3YXkgSSBjYW4gdGhpbmsgb2YgaXMgdG8gYnVpbGQgYSAoVVRGLTgpIGZp bGUgbmFtZWQKaXRy
YW5zMnVuaWNvZGUudGFibGUgdGhhdCBsb29rcyBsaWtlIHRoaXMKCmEgICA9 PiBhCmFhID0+IMSB
Cn5OID0+IOG5hQoKVGhlbiByZWFkIHRoYXQgZmlsZSBpbnRvIGEgaGFzaCBh dCBzdGFydHVwIGFu
ZCB0aGVuIHByb2Nlc3MgdGhlIGZpbGUKbGluZSBieSBsaW5lIHVzaW5nIGEg cmVnZXggbGlrZQoK
JGxpbmUgPX4gcy8oLikvJHRhYmxleyQxfS9nOwoKVGhlcmUgaXMgc3VwcG9z ZWRseSBhIGZ1bGwg
dGFibGUgYXQKaHR0cDovL3d3dy5hY3pvb20uY29tL2l0cmFucy8jaXRyYW5z ZW5jb2RpbmcgYnV0
IEkgd2FzIHVuYWJsZSB0byBsb2FkCnRoYXQgcGFnZS4KCi0tIApDaGFzLiBP d2Vucwp3b25rZGVu
Lm5ldApUaGUgbW9zdCBpbXBvcnRhbnQgc2tpbGwgYSBwcm9ncmFtbWVyIGNh biBoYXZlIGlzIHRo
ZSBhYmlsaXR5IHRvIHJlYWQuCg==

Re: Advice on how to approach character translation

am 24.04.2008 00:50:48 von Jenda Krynicky

From: "R (Chandra) Chandrasekhar"
> 3. Some transliteration examples are shown below:
>
> a a U+0061 LATIN SMALL LETTER A
> aa a U+0101 LATIN SMALL LETTER A WITH MACRON
> A a U+0101 LATIN SMALL LETTER A WITH MACRON
> .a ' U+0027 APOSTROPHE
> ~N U+1E45 LATIN SMALL LETTER N WITH DOT ABOVE
> RRI U+1E5D LATIN SMALL LETTER R WITH DOT BELOW AND MACRON
> R^I U+1E5D LATIN SMALL LETTER R WITH DOT BELOW AND MACRON

Put the transliteration rules into a hash like this:

%trans =3D (
'aa' =3D> 'a',
'A' =3D> 'a',
'.a' =3D> "'",
...
);

and build a regexp to match the 1-3 characters to replace:

@signs =3D sort {length($b) <=3D> length($a)} keys %trans;
@signs =3D map quotemeta($_) @signs;
$re =3D join '|', @signs, '.';

and use the regexp to split the text into pieces and transliterate
them.

$text =3D~ s/($re)/exists($trans{$1}) ? $trans{$1} : $1/geo;

HTH, Jenda
=====3D Jenda@Krynicky.cz ===3D http://Jenda.Krynicky.cz ===
===3D
When it comes to wine, women and song, wizards are allowed
to get drunk and croon as much as they like.
-- Terry Pratchett in Sourcery


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 24.04.2008 17:40:12 von Chandra

Chas. Owens wrote:

> The easiest way I can think of is to build a (UTF-8) file named
> itrans2unicode.table that looks like this
>
> a => a
> aa => ā
> ~N => ṅ
>

I have successfully created the file lookup.table containing lines as suggested
above with ASCII and Unicode characters separated by ' => '.

> Then read that file into a hash at startup

Is there an easy way to do this directly?

When I read the file into a hash, I used ' => ' as a separator pattern for split
and key value assignments as shown below:

-----------
#!/usr/bin/perl -C24
use warnings;
use diagnostics;
use strict;
use utf8;

open my $fh, "<:utf8", "lookup.table";
my @lookup = <$fh>;
close $fh;
binmode STDOUT, ':utf8';

my %lookup = ();
foreach my $line (@lookup)
{
my ($key, $value) = split / => /, $line;
$lookup{$key} = $value;
print "$key => $lookup{$key}\n";
}
-----------

Is there another, easier way to load the file into a hash, using the already
existing => symbol in the file?

Otherwise, inserting the ' => ' seems a wasted effort. One could just as well
have used the original two column space or tab separated file and read it in
using the -a option and @F array to assign the ASCII symbol in column one to the
key and the Unicode symbol in column two to the value.

Thank you.

Chandra

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 24.04.2008 18:22:56 von chas.owens

T24gVGh1LCBBcHIgMjQsIDIwMDggYXQgMTE6NDAgQU0sIFIgKENoYW5kcmEp IENoYW5kcmFzZWto
YXIKPGNoYW5kcmFAZWUudXdhLmVkdS5hdT4gd3JvdGU6Cj4gQ2hhcy4gT3dl bnMgd3JvdGU6Cj4K
Pgo+ID4gVGhlIGVhc2llc3Qgd2F5IEkgY2FuIHRoaW5rIG9mIGlzIHRvIGJ1 aWxkIGEgKFVURi04
KSBmaWxlIG5hbWVkCj4gPiBpdHJhbnMydW5pY29kZS50YWJsZSB0aGF0IGxv b2tzIGxpa2UgdGhp
cwo+ID4KPiA+IGEgICA9PiBhCj4gPiBhYSA9PiDEgQo+ID4gfk4gPT4g4bmF Cj4gPgo+ID4KPgo+
ICBJIGhhdmUgc3VjY2Vzc2Z1bGx5IGNyZWF0ZWQgdGhlIGZpbGUgbG9va3Vw LnRhYmxlIGNvbnRh
aW5pbmcgbGluZXMgYXMKPiBzdWdnZXN0ZWQgYWJvdmUgd2l0aCBBU0NJSSBh bmQgVW5pY29kZSBj
aGFyYWN0ZXJzIHNlcGFyYXRlZCBieSAnID0+ICcuCj4KPgo+Cj4gPiBUaGVu IHJlYWQgdGhhdCBm
aWxlIGludG8gYSBoYXNoIGF0IHN0YXJ0dXAKPiA+Cj4KPiAgSXMgdGhlcmUg YW4gZWFzeSB3YXkg
dG8gZG8gdGhpcyBkaXJlY3RseT8KPgo+ICBXaGVuIEkgcmVhZCB0aGUgZmls ZSBpbnRvIGEgaGFz
aCwgSSB1c2VkICcgPT4gJyBhcyBhIHNlcGFyYXRvciBwYXR0ZXJuIGZvcgo+ IHNwbGl0IGFuZCBr
ZXkgdmFsdWUgYXNzaWdubWVudHMgYXMgc2hvd24gYmVsb3c6Cj4KPiAgLS0t LS0tLS0tLS0KPiAg
IyEvdXNyL2Jpbi9wZXJsIC1DMjQKPiAgdXNlIHdhcm5pbmdzOwo+ICB1c2Ug ZGlhZ25vc3RpY3M7
Cj4gIHVzZSBzdHJpY3Q7Cj4gIHVzZSB1dGY4Owo+Cj4gIG9wZW4gbXkgJGZo LCAiPDp1dGY4Iiwg
Imxvb2t1cC50YWJsZSI7Cj4gIG15IEBsb29rdXAgPSA8JGZoPjsKPiAgY2xv c2UgJGZoOwo+ICBi
aW5tb2RlIFNURE9VVCwgJzp1dGY4JzsKPgo+ICBteSAlbG9va3VwID0gKCk7 Cj4gIGZvcmVhY2gg
bXkgJGxpbmUgKEBsb29rdXApCj4gICAgIHsKPiAgICAgbXkgKCRrZXksICR2 YWx1ZSkgPSBzcGxp
dCAvID0+IC8sICRsaW5lOwo+ICAgICAkbG9va3VweyRrZXl9ID0gJHZhbHVl Owo+ICAgICBwcmlu
dCAiJGtleSA9PiAkbG9va3VweyRrZXl9XG4iOwo+ICAgICB9Cj4gIC0tLS0t LS0tLS0tCj4KPiAg
SXMgdGhlcmUgYW5vdGhlciwgZWFzaWVyIHdheSB0byBsb2FkIHRoZSBmaWxl IGludG8gYSBoYXNo
LCB1c2luZyB0aGUKPiBhbHJlYWR5IGV4aXN0aW5nID0+IHN5bWJvbCBpbiB0 aGUgZmlsZT8KPgo+
ICBPdGhlcndpc2UsIGluc2VydGluZyB0aGUgJyA9PiAnIHNlZW1zIGEgd2Fz dGVkIGVmZm9ydC4g
T25lIGNvdWxkIGp1c3QgYXMKPiB3ZWxsIGhhdmUgdXNlZCB0aGUgb3JpZ2lu YWwgdHdvIGNvbHVt
biBzcGFjZSBvciB0YWIgc2VwYXJhdGVkIGZpbGUgYW5kIHJlYWQKPiBpdCBp biB1c2luZyB0aGUg
LWEgb3B0aW9uIGFuZCBARiBhcnJheSB0byBhc3NpZ24gdGhlIEFTQ0lJIHN5 bWJvbCBpbiBjb2x1
bW4KPiBvbmUgdG8gdGhlIGtleSBhbmQgdGhlIFVuaWNvZGUgc3ltYm9sIGlu IGNvbHVtbiB0d28g
dG8gdGhlIHZhbHVlLgo+Cj4gIFRoYW5rIHlvdS4KPgo+ICBDaGFuZHJhCj4K ClRoZXJlIGlzIG5v
IGdyZWF0IGJlbmVmaXQgdG8gdXNpbmcgPT4gYXMgdGhlIHNlcGFyYXRvci4g IEkgdXNlZCBpdApi
ZWNhdXNlIG9mIGl0cyBpbXBsaWVkIG1lYW5pbmcgaW4gUGVybCAoa2V5IG9u IHRoZSBsZWZ0LCB2
YWx1ZSBvbiB0aGUKcmlnaHQpLgoKQWxzbyB0aGUgc3Vic3RpdHV0aW9uIEkg bWVudGlvbmVkIGlu
IG15IGVtYWlsIHdvbid0IHdvcmsgZm9yIHlvdS4gIFlvdQpwYXR0ZXJucyBh cmUgYmV0d2VlbiBv
bmUgYW5kIHRocmVlIGNoYXJhY3RlcnMgbG9uZyAoYW5kIHRoZSByZWdleApk ZWFsdCB3aXRoIGEg
Y2hhcmFjdGVyIGF0IGEgdGltZSkuICBZb3Ugd2lsbCBwcm9iYWJseSBuZWVk IHNvbWV0aGluZwpt
b3JlIGxpa2UKCm15ICRwYXR0ZXJuID0gam9pbiAifCIsIHNvcnQga2V5cyAl bG9va3VwOwokcGF0
dGVybiA9IHFyLyRwYXR0ZXJuLzsKCndoaWxlICg8PikgewogICAgcy8oJHBh dHRlcm4pLyRsb29r
dXB7JDF9L2dlOwogICAgcHJpbnQ7Cn0KCgotLSAKQ2hhcy4gT3dlbnMKd29u a2Rlbi5uZXQKVGhl
IG1vc3QgaW1wb3J0YW50IHNraWxsIGEgcHJvZ3JhbW1lciBjYW4gaGF2ZSBp cyB0aGUgYWJpbGl0
eSB0byByZWFkLgo=

Re: Advice on how to approach character translation

am 25.04.2008 11:41:47 von Chandra

Jenda Krynicky wrote:
....
> and build a regexp to match the 1-3 characters to replace:
>
> @signs = sort {length($b) <=> length($a)} keys %trans;

Thanks for this priceless construct. It was very helpful indeed.

> @signs = map quotemeta($_) @signs;

@signs = map quotemeta($_), @signs; # needed a comma here

> $re = join '|', @signs, '.';
>
> and use the regexp to split the text into pieces and transliterate
> them.
>
> $text =~ s/($re)/exists($trans{$1}) ? $trans{$1} : $1/geo;

I have confirmed that it works as intended.

> HTH, Jenda

Thank you very much.

Chandra

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 26.04.2008 11:41:42 von rvtol+news

"Jenda Krynicky" schreef:

> @signs = map quotemeta($_) @signs;

@signs = map quotemeta($_), @signs;

(there was a comma missing)

which you could even write as

@signs = map quotemeta, @signs;

--
Affijn, Ruud

"Gewoon is een tijger."

sub uniq {
my $prev;
map $_ eq ($prev || "") ? () : ($prev = $_), @_;
}


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 26.04.2008 13:39:16 von peng.kyo

On Sat, Apr 26, 2008 at 5:41 PM, Dr.Ruud wrote:
> "Jenda Krynicky" schreef:
>
>
> > @signs = map quotemeta($_) @signs;
>
> @signs = map quotemeta($_), @signs;
>
> (there was a comma missing)
>
> which you could even write as
>
> @signs = map quotemeta, @signs;
>

or:
@signs = map { quotemeta } @signs;

--
J. Peng - QQMail Operation Team
eMail: peng.kyo@gmail.com AIM: JeffHua

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 26.04.2008 13:54:51 von rvtol+news

J. Peng schreef:
> Dr.Ruud:
>> Jenda Krynicky:

>>> @signs = map quotemeta($_) @signs;
>>
>> @signs = map quotemeta($_), @signs;
>> (there was a comma missing)
>> which you could even write as
>> @signs = map quotemeta, @signs;
>
> or:
> @signs = map { quotemeta } @signs;

That is not the same, because it sets up a codeblock.
See `perldoc -f map`.

--
Affijn, Ruud

"Gewoon is een tijger."

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 26.04.2008 13:59:08 von peng.kyo

On Sat, Apr 26, 2008 at 7:54 PM, Dr.Ruud wrote:
> J. Peng schreef:
> > Dr.Ruud:
> >> Jenda Krynicky:
>
>
> >>> @signs = map quotemeta($_) @signs;
> >>
> >> @signs = map quotemeta($_), @signs;
> >> (there was a comma missing)
> >> which you could even write as
> >> @signs = map quotemeta, @signs;
> >
> > or:
> > @signs = map { quotemeta } @signs;
>
> That is not the same, because it sets up a codeblock.
> See `perldoc -f map`.
>

I mean the results are the same, not the process.
See also `perldoc -f map'.

--
J. Peng - QQMail Operation Team
eMail: peng.kyo@gmail.com AIM: JeffHua

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Advice on how to approach character translation

am 26.04.2008 15:28:59 von rvtol+news

J. Peng schreef:
> Dr.Ruud:
>> J. Peng:
>>> Dr.Ruud:
>>>> Jenda Krynicky:

>>>>> @signs = map quotemeta($_) @signs;
>>>>
>>>> @signs = map quotemeta($_), @signs;
>>>> (there was a comma missing)
>>>> which you could even write as
>>>> @signs = map quotemeta, @signs;
>>>
>>> or:
>>> @signs = map { quotemeta } @signs;
>>
>> That is not the same, because it sets up a codeblock.
>> See `perldoc -f map`.
>
> I mean the results are the same, not the process.
> See also `perldoc -f map'.

One of the results of writing it the way you added, is that a codeblock
is set up.

--
Affijn, Ruud

"Gewoon is een tijger."


--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/