polymorphic regex -- encoding issue

polymorphic regex -- encoding issue

am 18.10.2007 10:28:13 von Dale

Consider the following:

my $html_string =3D get "http://stock.narod.ru/fibo.htm";
my $russian_page =3D decode("cp1251", $html_string);
while ($russian_page =3D~ m/(Фибонач=D1=
‡Ð=B8)\s+\b(\w+)/g) {
print "$1 $2\n";
}

I get a CP1251-encoded page from a Russian site and search for words
that might follow the word Фибоначч=
и (Fibonacci). But isn't this bit
of code inefficient? I start right off by decoding the whole page,
where I really only need to have decoded those portions of the page
that match. So wouldn't it be better to encode the regex in CP1251 to
do the matching, and then convert any matched strings to the encoding
I want before printing out. Something like the following:

$russian_page =3D get "http://stock.narod.ru/fibo.htm";
my $search_word =3D encode("cp1251", "Фибона=
ччи");
while ($russian_page =3D~ m/($search_word)\s+(\w+)/g) {
print decode("cp1251", "$1 $2\n");
}

This doesn't obviously fail, but it doesn't give the expected result
either. Presumably, the problem is that I've only encoded part of my
regex in CP1251. So the question is: Is there a way to change the
encoding of a regular expression?

A couple details:

Perl version:
58.8

Pragmas and modules used:
LWP::Simple
utf8;
Encode;
binmode(STDOUT, ":utf8");

Re: polymorphic regex -- encoding issue

am 18.10.2007 14:36:27 von Ben Morrow

Quoth Dale :
> Consider the following:
>
> my $html_string = get "http://stock.narod.ru/fibo.htm";
> my $russian_page = decode("cp1251", $html_string);
> while ($russian_page =~ m/(Фибоначчи)\s+\b(\w+)/g) {
> print "$1 $2\n";
> }
>
> I get a CP1251-encoded page from a Russian site and search for words
> that might follow the word Фибоначчи (Fibonacci). But isn't this bit
> of code inefficient? I start right off by decoding the whole page,
> where I really only need to have decoded those portions of the page
> that match. So wouldn't it be better to encode the regex in CP1251 to
> do the matching, and then convert any matched strings to the encoding
> I want before printing out. Something like the following:
>
> $russian_page = get "http://stock.narod.ru/fibo.htm";
> my $search_word = encode("cp1251", "Фибоначчи");
> while ($russian_page =~ m/($search_word)\s+(\w+)/g) {
> print decode("cp1251", "$1 $2\n");
> }
>
> This doesn't obviously fail, but it doesn't give the expected result
> either. Presumably, the problem is that I've only encoded part of my
> regex in CP1251. So the question is: Is there a way to change the
> encoding of a regular expression?

Nope, there isn't. All you can do is decode all the separate parts into
bytes, and then ask for a regex that matches by bytes.

At the very least you want a 'use bytes' around that regex and match.
You also need to be aware that perl will be doing a byte-by-byte match,
so if it's possible for part of a character to match (which depends on
the encoding: it is possible with UTF16, but not with UTF8, for
instance. I'm afraid I don't know about cp1251.) you will get false
positives. You also need to be sure that LWP is returning you the page
as bytes, and not trying to be clever and decoding it to UTF8 already. I
presume you already know that.

Unless you have an awful lot of these matches to do (and you know this
is what's slowing you down), it's not worth the bother.

Ben

Re: polymorphic regex -- encoding issue

am 19.10.2007 13:55:09 von Dale

Thanks Ben. The problem is, of course consistency. I want to make
sure, that I also decode '\w' and '\s' so that they match the same
things that they would have matched in the original regex. The perldoc
says one can influence what '\w' matches by using locales. But I
managed to find a consistent translation without using locales (now
I'm answering my own question):


# As before, I search for the word Fibonacci, in CP1251-encoded
Cyrillic
my $search_word =3D encode("cp1251", "Фибона=
ччи");

# CP1251 is an extended ASCII charset in the range 00-FF. Here we
# get this set of characters and decode them into Unicode.
my @cp1251_charset =3D
split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

# Find out which of these characters are matched by '\w' (in Unicode).
my @cp1251_wordchars =3D
grep(/\w/, @cp1251_charset);

# The matched word characters are put back into CP1251
my $w =3D encode("CP1251", join("", @cp1251_wordchars));

# We follow the same idea as above for the space characters.
my @cp1251_spacechars =3D
grep(/\s/, @cp1251_charset);
my $s =3D encode("CP1251", join("", @cp1251_spacechars));

# Now we just put the pieces together
my $russian_page =3D get "http://stock.narod.ru/fibo.htm";
while ($russian_page =3D~ m/($search_word)[$s]([$w]+)/g) {
print decode("cp1251", "$1 $2\n");
}


Details (same as in previous version):

Perl version
58.8

modules used
Encode;
LWP::Simple qw(get);
utf8;
binmode(STDOUT, ":utf8");

Note: Why didn't I use setlocale, as the Perldoc suggests? First
reason: Our computers are somehow set up with a very limited range of
possible locales. Second reason: locales are confusing for me. I
prefer to avoid them. I set my environment to en_US.utf8 and I don't
want to think about locales any more after that.

Re: polymorphic regex -- encoding issue

am 19.10.2007 23:09:13 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dale
], who wrote in article <1192794909.831588.269070@z24g2000prh.googlegroups.com>:
> # CP1251 is an extended ASCII charset in the range 00-FF. Here we
> # get this set of characters and decode them into Unicode.
> my @cp1251_charset =3D
> split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));
>
> # Find out which of these characters are matched by '\w' (in Unicode).
> my @cp1251_wordchars =3D
> grep(/\w/, @cp1251_charset);
>
> # The matched word characters are put back into CP1251
> my $w =3D encode("CP1251", join("", @cp1251_wordchars));

To baroque, IMO. I would use something like

my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr, 0x00..0xFF;

Your approach has a chance to be quickier, though, but since this
should only run once... [I did not benchmark them.]

Ilya

Re: polymorphic regex -- encoding issue

am 20.10.2007 15:48:41 von rvtol+news

Ilya Zakharevich schreef:

> my $w = join '', grep +(decode 'cp1251', $_) =~ /\w/, map chr,
> 0x00..0xFF;

Alternative:

my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;

--
Affijn, Ruud

"Gewoon is een tijger."

Re: polymorphic regex -- encoding issue

am 21.10.2007 07:54:30 von Dale Gerdemann

Thanks Ilya and Affijn for your "improvements" but I still like my own
code better, because at least I break it down into commented steps. I
know my comments are minimal, but at least I tried. The reader of my
code is bound to find several things confusing:

> my @cp1251_charset =3D
> split(//, decode("CP1251", join("", map { chr } 0x00..0xFF)));

Here are some questions that are bound to arise:

Why "decode CP1251"? How can you see that the input was ever encoded
as CP1251 to begin with? We must be assuming that 'chr' returns
something that can at least be thought of as as CP1251 encoded. But
consider the small test program:

print chr(0xFF);

This may print out =FF (LATIN SMALL LETTER Y WITH DIAERESIS), a
character that doesn't even exist in CP1251. Of course, it only prints
out this character if you're using "binmode(STDOUT, ":utf8");" or "use
encoding 'utf8';", but you can see that there is plenty of room for
confusion.

Then there is the issue of what is stored in "@cp1251_charset". Since
it's the output of 'decode', then it must be decoded, right? Whatever
"decoded" means. You see my point. A comment would be helpful, and
this won't be possible if you pack everything into one line.

But what the "improvers" of my code also missed is that I had a second
reason for the itermediate step. I wanted the complete CP1251 charset
stored in a variable so that I could make several passes through it.
As you see in the small example I made two passes. Once for '\w' and
once for '\s'.

I'm sure there are legitimate improvements that could be made to my
code, but it baffles me that people should see packing into a oneliner
as something virtuous.

Dale Gerdemann

Re: polymorphic regex -- encoding issue

am 21.10.2007 18:02:04 von rvtol+news

Dale Gerdemann schreef:

> Thanks Ilya and Affijn for your "improvements" but I still like my own
> code better, because at least I break it down into commented steps.

Ahem, you are replying to the wrong message. I reply to the part that I
quote. So the relation to your code was broken by me on purpose.


> But what the "improvers" of my code also missed is that I had a second
> reason for the itermediate step. I wanted the complete CP1251 charset
> stored in a variable so that I could make several passes through it.
> As you see in the small example I made two passes. Once for '\w' and
> once for '\s'.

What you are missing is that the $w in

my $w = pack "C*", grep decode('cp1251', chr) =~ /\w/, 0..255;

contains exactly what is in your $w.

So for $s you can just do:

my $s = pack "C*", grep decode('cp1251', chr) =~ /\s/, 0..255;


Perhaps you like it more like this:

$cp1251_word_chars =
pack("C*", grep decode('cp1251', chr) =~ /\w/, 0..255);
$cp1251_whitespace_chars =
pack("C*", grep decode('cp1251', chr) =~ /\s/, 0..255);

so that your

m/($search_word)[$s]([$w]+)/g)

becomes

m/($search_word)[$cp1251_whitespace_chars]([$cp1251_word_cha rs]+)/g


And maybe you should allow more than 1 whitespace character there:

m/($search_word)[$cp1251_whitespace_chars]+([$cp1251_word_ch ars]+)/g


And if your $search_word can ever contain regex metacharacters, look
into quotemeta.

--
Affijn, Ruud

"Gewoon is een tijger."

Re: polymorphic regex -- encoding issue

am 24.10.2007 08:23:44 von Ilya Zakharevich

[A complimentary Cc of this posting was sent to
Dale Gerdemann
], who wrote in article <1192946070.082466.54940@t8g2000prg.googlegroups.com>:
> But what the "improvers" of my code also missed is that I had a second
> reason for the itermediate step. I wanted the complete CP1251 charset
> stored in a variable so that I could make several passes through it.
> As you see in the small example I made two passes. Once for '\w' and
> once for '\s'.

What makes you think that "improvers of your code" missed this? At
least, I explicitly said that your solution might be quickier.

> I'm sure there are legitimate improvements that could be made to my
> code, but it baffles me that people should see packing into a oneliner
> as something virtuous.

It was "your code packed into a oneliner". It was absolutely
different code; and if you do not like oneliners, just unpack it using
dummy variables.

What your code had was using encode/decode cycle, while your intent
was, obviously, to do only a decode. I corrected your code to match
your intent.

Hope this helps,
Ilya