Bookmarks

Yahoo Gmail Google Facebook Delicious Twitter Reddit Stumpleupon Myspace Digg

Search queries

sqldatasource dal, wwwxxxenden, convert raid5 to raid 10 mdadm, apache force chunked, nrao wwwxxx, xxxxxdup, procmail change subject header, wwwXxx not20, Wwwxxx.doks sas, linux raid resync after reboot

Links

XODOX
Impressum

#1: Matching Greek letters in UTF-8 file

Posted on 2011-09-29 13:42:49 by hamann

Hi,
=20
I need to write a regex that matches any single Greek letter followed by a =
hyphen in a UTF-8 text file that is otherwise in English.
=20
How can I match the Greek alphabet (lower and upper case)?
=20
Thanks,
Thomas=

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message

#2: Re: Matching Greek letters in UTF-8 file

Posted on 2011-09-29 15:58:09 by John Delacour

At 11:42 +0000 29/9/11, Hamann, T.D. (Thomas) wrote:

>I need to write a regex that matches any single Greek letter=20
>followed by a hyphen in a UTF-8 text file that is otherwise in=20
>English.
>
>How can I match the Greek alphabet (lower and upper case)?

#!/usr/local/bin/perl
use strict;
use utf8;
use encoding 'utf-8';
$_ =3D "ÃÂñÃÂÃÂùúìñ, ÃÂï =CE=
½Ã=8D ÃÂá¾=BD á½=A7-ôõ üõøî=
üÿýñ=20
óõïýñÃÂÿ üîÃÂ÷ÃÂ=
;";
print $1 if /(\p{Greek}-)/;

This will match polyhinic Greek as well, as you will see if you put a=20
hyphen after any letter in $_

JD

--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message

#3: Re: Matching Greek letters in UTF-8 file

Posted on 2011-09-29 16:59:10 by Brian Fraser

--00151747910a55808404ae15c01d
Content-Type: text/plain; charset=UTF-8

On Thu, Sep 29, 2011 at 10:58 AM, John Delacour <johndelacour@gmail.com>wrote:

> use encoding 'utf-8';
>
>

Nitpick: Please don't use this, as encoding is broken. use utf8; and use
open qw< :std :encoding(UTF-8) >; should make do for a replacement.

To the original poster, please note that there's a bit of a difference in
case-insensitive matching (i.e. using /i) -- newer versions of Perl do full
casefolding (so \N{GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI}
matches \N{GREEK SMALL LETTER ALPHA WITH PSILI}\N{GREEK SMALL LETTER IOTA}),
whereas older versions don't. So if you need to do that, I'd recommend
giving the docs a thorough read. Also this:
http://98.245.80.27/tcpc/OSCON2011/upr.html

--00151747910a55808404ae15c01d--

Report this message

#4: Re: Matching Greek letters in UTF-8 file

Posted on 2011-09-29 21:03:16 by John Delacour

At 11:59 -0300 29/9/11, Brian Fraser wrote:


>On Thu, Sep 29, 2011 at 10:58 AM, John Delacour
><johndelacour@gmail.com> wrote:
>
>use encoding 'utf-8';
>
>Nitpick: Please don't use this, as encoding is broken. use utf8; and
>use open qw< :std :encoding(UTF-8) >; should make do for a
>replacement.

Nitpick: Why the upper-case charset name?

Interesting to hear that encoding is broken. I came across a problem
the other day wich I couldn't work out at all. But if you include
'qw< :std :encoding(utf-8) >', is that not also using encoding?

Another thing I realized the other day is that a the problem that
existed in Perl 5.10 and 5.12, even when I updated Encode for these
versions, seemed to have been solved in 5.14.

JD

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message

#5: Re: Matching Greek letters in UTF-8 file

Posted on 2011-09-29 22:29:35 by Brian Fraser

--00151747b634fbca4904ae1a5db6
Content-Type: text/plain; charset=UTF-8

On Thu, Sep 29, 2011 at 4:03 PM, John Delacour <johndelacour@gmail.com>wrote:

>
>
Nitpick: Why the upper-case charset name?
>

Uppercase is UTF-8-strict, while lowercase is the lax version that perl uses
internally. Unless you are passing data from one perl program to another,
and you are using illegal-UTF8-but-legal-UTFX (like if you define your own
new characters beyond 10FFFF, which Perl allows but strict UTF-8 shouldn't),
there's basically no good reason to use the lax version.


>
> Interesting to hear that encoding is broken. I came across a problem the
> other day wich I couldn't work out at all. But if you include 'qw< :std
> :encoding(utf-8) >', is that not also using encoding?
>
>
Sorta, but not quite. use encoding ...; does a couple of different things,
and most of them not that well. Foremost it sets the source encoding to some
arbitrary value, but also the default encodings for IO streams -- whereas
use open ...; only sets the default encodings, with :std setting them for
STD(IN|ERR|OUT).
The :encoding() part actually refers to a layer provided by a module, not
the encoding pragma.

Yeah, it's a mess. :)

--00151747b634fbca4904ae1a5db6--

Report this message

#6: Re: Matching Greek letters in UTF-8 file

Posted on 2011-09-30 00:26:01 by John Delacour

At 17:29 -0300 29/9/11, Brian Fraser wrote:

>On Thu, Sep 29, 2011 at 4:03 PM, John Delacour <johndelacour@gmail.com> wrote:
>
>>Nitpick: Why the upper-case charset name?
>
>Uppercase is UTF-8-strict, while lowercase is the lax version that
>perl uses internally. Unless you are passing data from one perl
>program to another, and you are using illegal-UTF8-but-legal-UTFX
>(like if you define your own new characters beyond 10FFFF, which
>Perl allows but strict UTF-8 shouldn't), there's basically no good
>reason to use the lax version.

Right. Thank you for the explanation. Perl has so affected my
thinking that I'd forgotten that RFC 3629 names it upper case 'UTF-8'.

>>But if you include 'qw< :std :encoding(utf-8) >', is that not also
>>using encoding?
>
>...The :encoding() part actually refers to a layer provided by a
>module, not the encoding pragma.

Gotcha.

>Yeah, it's a mess. :)

Yes, but thank goodness for Dan Kogai et al. I'd sooner have them
sort it out than do it myself!

JD



--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message

#7: RE: Matching Greek letters in UTF-8 file

Posted on 2011-10-10 13:51:07 by hamann

Many thanks for the replies. Reading the documentation, it looks like it's =
a bit more complicated than I had hoped.

On the other hand, I realized that for my purpose (removing unwanted hyphen=
s from an OCR'ed document), I don't actually need to match the greek letter=
s, because they occur in two unique formats throughout the whole document (=
which should match \w- and -\w- ).

Thomas


________________________________________
Van: Brian Fraser [fraserbn@gmail.com]
Verzonden: donderdag 29 september 2011 16:59
Aan: John Delacour
CC: beginners@perl.org
Onderwerp: Re: Matching Greek letters in UTF-8 file

On Thu, Sep 29, 2011 at 10:58 AM, John Delacour <johndelacour@gmail.com>wro=
te:

> use encoding 'utf-8';
>
>

Nitpick: Please don't use this, as encoding is broken. use utf8; and use
open qw< :std :encoding(UTF-8) >; should make do for a replacement.

To the original poster, please note that there's a bit of a difference in
case-insensitive matching (i.e. using /i) -- newer versions of Perl do full
casefolding (so \N{GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI=
}
matches \N{GREEK SMALL LETTER ALPHA WITH PSILI}\N{GREEK SMALL LETTER IOTA})=
,
whereas older versions don't. So if you need to do that, I'd recommend
giving the docs a thorough read. Also this:
http://98.245.80.27/tcpc/OSCON2011/upr.html=

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Report this message