Matching Greek letters in UTF-8 file

Matching Greek letters in UTF-8 file

am 29.09.2011 13:42:49 von hamann

Hi,
=20
I need to write a regex that matches any single Greek letter followed by a =
hyphen in a UTF-8 text file that is otherwise in English.
=20
How can I match the Greek alphabet (lower and upper case)?
=20
Thanks,
Thomas=

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Matching Greek letters in UTF-8 file

am 29.09.2011 15:58:09 von John Delacour

At 11:42 +0000 29/9/11, Hamann, T.D. (Thomas) wrote:

>I need to write a regex that matches any single Greek letter=20
>followed by a hyphen in a UTF-8 text file that is otherwise in=20
>English.
>
>How can I match the Greek alphabet (lower and upper case)?

#!/usr/local/bin/perl
use strict;
use utf8;
use encoding 'utf-8';
$_ =3D "Ναυσικάα, τί =CE=
½Ï=8D σá¾=BD á½=A7-δε μεθή=
μονα=20
γείνατο μήτηρ=
;";
print $1 if /(\p{Greek}-)/;

This will match polyhinic Greek as well, as you will see if you put a=20
hyphen after any letter in $_

JD

--=20
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Matching Greek letters in UTF-8 file

am 29.09.2011 16:59:10 von Brian Fraser

--00151747910a55808404ae15c01d
Content-Type: text/plain; charset=UTF-8

On Thu, Sep 29, 2011 at 10:58 AM, John Delacour wrote:

> use encoding 'utf-8';
>
>

Nitpick: Please don't use this, as encoding is broken. use utf8; and use
open qw< :std :encoding(UTF-8) >; should make do for a replacement.

To the original poster, please note that there's a bit of a difference in
case-insensitive matching (i.e. using /i) -- newer versions of Perl do full
casefolding (so \N{GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI}
matches \N{GREEK SMALL LETTER ALPHA WITH PSILI}\N{GREEK SMALL LETTER IOTA}),
whereas older versions don't. So if you need to do that, I'd recommend
giving the docs a thorough read. Also this:
http://98.245.80.27/tcpc/OSCON2011/upr.html

--00151747910a55808404ae15c01d--

Re: Matching Greek letters in UTF-8 file

am 29.09.2011 21:03:16 von John Delacour

At 11:59 -0300 29/9/11, Brian Fraser wrote:


>On Thu, Sep 29, 2011 at 10:58 AM, John Delacour
> wrote:
>
>use encoding 'utf-8';
>
>Nitpick: Please don't use this, as encoding is broken. use utf8; and
>use open qw< :std :encoding(UTF-8) >; should make do for a
>replacement.

Nitpick: Why the upper-case charset name?

Interesting to hear that encoding is broken. I came across a problem
the other day wich I couldn't work out at all. But if you include
'qw< :std :encoding(utf-8) >', is that not also using encoding?

Another thing I realized the other day is that a the problem that
existed in Perl 5.10 and 5.12, even when I updated Encode for these
versions, seemed to have been solved in 5.14.

JD

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

Re: Matching Greek letters in UTF-8 file

am 29.09.2011 22:29:35 von Brian Fraser

--00151747b634fbca4904ae1a5db6
Content-Type: text/plain; charset=UTF-8

On Thu, Sep 29, 2011 at 4:03 PM, John Delacour wrote:

>
>
Nitpick: Why the upper-case charset name?
>

Uppercase is UTF-8-strict, while lowercase is the lax version that perl uses
internally. Unless you are passing data from one perl program to another,
and you are using illegal-UTF8-but-legal-UTFX (like if you define your own
new characters beyond 10FFFF, which Perl allows but strict UTF-8 shouldn't),
there's basically no good reason to use the lax version.


>
> Interesting to hear that encoding is broken. I came across a problem the
> other day wich I couldn't work out at all. But if you include 'qw< :std
> :encoding(utf-8) >', is that not also using encoding?
>
>
Sorta, but not quite. use encoding ...; does a couple of different things,
and most of them not that well. Foremost it sets the source encoding to some
arbitrary value, but also the default encodings for IO streams -- whereas
use open ...; only sets the default encodings, with :std setting them for
STD(IN|ERR|OUT).
The :encoding() part actually refers to a layer provided by a module, not
the encoding pragma.

Yeah, it's a mess. :)

--00151747b634fbca4904ae1a5db6--

Re: Matching Greek letters in UTF-8 file

am 30.09.2011 00:26:01 von John Delacour

At 17:29 -0300 29/9/11, Brian Fraser wrote:

>On Thu, Sep 29, 2011 at 4:03 PM, John Delacour wrote:
>
>>Nitpick: Why the upper-case charset name?
>
>Uppercase is UTF-8-strict, while lowercase is the lax version that
>perl uses internally. Unless you are passing data from one perl
>program to another, and you are using illegal-UTF8-but-legal-UTFX
>(like if you define your own new characters beyond 10FFFF, which
>Perl allows but strict UTF-8 shouldn't), there's basically no good
>reason to use the lax version.

Right. Thank you for the explanation. Perl has so affected my
thinking that I'd forgotten that RFC 3629 names it upper case 'UTF-8'.

>>But if you include 'qw< :std :encoding(utf-8) >', is that not also
>>using encoding?
>
>...The :encoding() part actually refers to a layer provided by a
>module, not the encoding pragma.

Gotcha.

>Yeah, it's a mess. :)

Yes, but thank goodness for Dan Kogai et al. I'd sooner have them
sort it out than do it myself!

JD



--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/

RE: Matching Greek letters in UTF-8 file

am 10.10.2011 13:51:07 von hamann

Many thanks for the replies. Reading the documentation, it looks like it's =
a bit more complicated than I had hoped.

On the other hand, I realized that for my purpose (removing unwanted hyphen=
s from an OCR'ed document), I don't actually need to match the greek letter=
s, because they occur in two unique formats throughout the whole document (=
which should match \w- and -\w- ).

Thomas


________________________________________
Van: Brian Fraser [fraserbn@gmail.com]
Verzonden: donderdag 29 september 2011 16:59
Aan: John Delacour
CC: beginners@perl.org
Onderwerp: Re: Matching Greek letters in UTF-8 file

On Thu, Sep 29, 2011 at 10:58 AM, John Delacour wro=
te:

> use encoding 'utf-8';
>
>

Nitpick: Please don't use this, as encoding is broken. use utf8; and use
open qw< :std :encoding(UTF-8) >; should make do for a replacement.

To the original poster, please note that there's a bit of a difference in
case-insensitive matching (i.e. using /i) -- newer versions of Perl do full
casefolding (so \N{GREEK CAPITAL LETTER ALPHA WITH PSILI AND PROSGEGRAMMENI=
}
matches \N{GREEK SMALL LETTER ALPHA WITH PSILI}\N{GREEK SMALL LETTER IOTA})=
,
whereas older versions don't. So if you need to do that, I'd recommend
giving the docs a thorough read. Also this:
http://98.245.80.27/tcpc/OSCON2011/upr.html=

--
To unsubscribe, e-mail: beginners-unsubscribe@perl.org
For additional commands, e-mail: beginners-help@perl.org
http://learn.perl.org/