utf-8

utf-8

am 31.12.2007 20:33:00 von julia_2683

I run perl v5.8.7 and my regular expresion is ($txt =3D~ m/(\w+|=E9\w+)/g)
which do not take every utf-8 word. How to make this regular
expression to take every utf-8 word ?

Re: utf-8

am 31.12.2007 20:45:57 von Joost Diepenmaat

julia_2683@hotmail.com writes:

> I run perl v5.8.7 and my regular expresion is ($txt =~ m/(\w+|é\w+)/g)
> which do not take every utf-8 word. How to make this regular
> expression to take every utf-8 word ?

Just \w should work, provided you're handling your encodings correctly *and*
your $txt is actually utf-8 encoded. This is IMO a bug.

Note that if your script itself is utf8 encoded you need to "use utf8"
somewhere at the top of your script.

For instance:

#/usr/bin/perl -w
use strict;

# set output stream as utf-8 encoded (i have a utf-8 enabled terminal)
binmode STDOUT,":utf8";

my $str="\x{e9}"; # "é", not necessarily as utf-8 - very likely latin-1
utf8::upgrade($str); # force utf-8 encoding

print "$str was ",($str =~ /\w+/ ? "" : "not "),"matched\n";

Joost.