Checking the presence of mutlibyte char in a string

am 05.01.2006 06:23:39 von klam

Hi,

Can anybody point me if there is any CPAN module that provides
routine to check the presence of multibyte char within a string. Thanks
in advance.

Rgds,
K.Lam

Re: Checking the presence of mutlibyte char in a string

am 05.01.2006 13:19:52 von jurgenex

klam wrote:
> Can anybody point me if there is any CPAN module that provides
> routine to check the presence of multibyte char within a string.

That is impossible. Depending on which encoding is being used the same byte
sequence can and typically will mean different things, e.g. a sequence of
several characters or maybe one single character. If you don't know the
encoding then you are pretty much out of luck. Heuristics for determining
the encoding are not very successful.

However if you do know the encoding, then you can check which characters are
represented in one byte for your encoding (e.g. any character in
Windows-1252 or ISO-Latin-1 or English characters in UTF-8 or ...) and which
characters are represented in multiple bytes for your encoding (e.g. any
Character in Windows-936 or German umlauts or French accented characters in
UTF-8 or any character in UTF-16 or UTF-32).

jue

Re: Checking the presence of mutlibyte char in a string

am 05.01.2006 13:42:48 von flavell

On Thu, 5 Jan 2006, Jürgen Exner wrote:

> Heuristics for determining the encoding are not very successful.

OT here: but Mozilla, given sufficient material to work from, seems to
do a well-above-average job at guessing encodings.

On the other hand, if the O.P has properly handled their external
coding, and got their data into proper Perl Unicode representation
(which was far from clear in the original posting) then, obviously,
the answer is that any character whose ord() is greater than 127 is a
multibyte character in Perl's utf8-based internal representation.

If all the characters in question have ord() values less than 256,
then one would need to work out whether the string has been promoted
to utf8 representation, or is still being held in 8-bit compatibility
representation only.

As usual, I think we need to know what the O.P is *really* trying to
achieve, rather than this only-partially-specified component of what
they suppose the solution to be.

Re: Checking the presence of mutlibyte char in a string

am 07.01.2006 15:00:29 von klam

To clarify, what I mean is that the string is already in Perl's
internal coding i.e. utf-8. If so, does there any Perl module that can
determine the presence of wild chars. in the string? Or, do I need to
check the string byte by byte to see if there is any byte that greater
than 127?

Thank you for any advise.

Re: Checking the presence of mutlibyte char in a string

am 07.01.2006 16:09:38 von jurgenex

klam wrote:
> To clarify,

To clarify what? Please quote appropriate context -as has been customary for
over two decades- when replying to a previous posting such that your readers
have a chance to know what you are talking about.

> what I mean is that the string is already in Perl's
> internal coding i.e. utf-8. If so, does there any Perl module that can
> determine the presence of wild chars. in the string?

What is a wild char? Do you mean the '*' character, like in a wild card?

> Or, do I need to check the string byte by byte

That's a contradiction in terms. Either you have a string which is composed
of a sequence of characters or you have a sequence of bytes. Only on
single-byte character sets characters and bytes happen to be the same.
That's why even today many people confuse them.

> to see if there is any byte that greater
> than 127?

I'm not sure what your intention is. If the second or third byte of a UTF-8
encoded character is less-or-equal 127, then what do you want to do with
that byte?

jue

Re: Checking the presence of mutlibyte char in a string

am 08.01.2006 04:13:35 von unknown

klam wrote:
> To clarify, what I mean is that the string is already in Perl's
> internal coding i.e. utf-8. If so, does there any Perl module that can
> determine the presence of wild chars. in the string? Or, do I need to
> check the string byte by byte to see if there is any byte that greater
> than 127?
>
> Thank you for any advise.
>

Assuming "wild" is a typo for "wide", how about the following "brute
force" method?

m/[^\x0-\x7f]/

Here's the logic:

We can't make use of the fact that the string has been promoted to
UTF-8, because it may not in fact contain any "wide" characters at the
moment. But, since we know that "m" matches CHARACTERS, and that the
first 128 are the same whether or not promotion has taken place, all we
need to do is to figure out whether there are any characters in the
string whose codepoints are not in the range 0 to 7f.

Caveat: I have no idea what will happen on an EBCDIC machine.

If it's really the UTF-8 flag you're interested in, see the Encode
documentation. This also has the address of the perl-unicode mailing list.

Like a number of other posters though, I can't quite see why you need to
know this. Would you enlighten us?

Tom Wyant

Re: Checking the presence of mutlibyte char in a string

am 08.01.2006 13:57:11 von klam

Hi Tom,

>Assuming "wild" is a typo for "wide", how about the following "brute
> force" method?

> m/[^\x0-\x7f]/

Thank you for the suggestion. The above checking works fine for
me. Sorry for the typo of "wild" character. What I mean is "wide"
characters (i.e. MB-multibyte characters).

>Like a number of other posters though, I can't quite see why you need to
> know this. Would you enlighten us?

Why I need to check if a string contains MB, because I'm working on
a mobile SMS (short message serviecs) program. If a string contains
purely English (or technically, single byte characters), the SMS can
contains 160 characters in normal utf8 format. But if it contains MB,
then only 70 chars can be send at a time, and the string need to be
encoded in UCS2 format.

Rgds,
K.Lam

Re: Checking the presence of mutlibyte char in a string

am 08.01.2006 18:19:50 von nobull67

Jürgen Exner wrote:

> If the second or third byte of a UTF-8
> encoded character is less-or-equal 127, then what do you want to do with
> that byte?

Slay a unicorn? :-)

(UTF-8 uses only bytes in the range 0x80-0xFF to encode code-points
beyond the range U+0000 - U+007F).