Regex to remove non printable characters

Regex to remove non printable characters

am 22.12.2007 03:54:33 von Larry

Hi peeps,

I'd like to remove all characters with ascii values > 127 from a
string...that's to say i'd like to remove non printable chars...

is the following fine?

my $input =~ s/[^ -~]+//g;

thanks ever so much!

Re: Regex to remove non printable characters

am 22.12.2007 04:19:19 von Glenn Jackman

At 2007-12-21 09:54PM, "Larry" wrote:
> Hi peeps,
>
> I'd like to remove all characters with ascii values > 127 from a
> string...that's to say i'd like to remove non printable chars...

You might want:
$string =~ s/\P{IsPrint}//g;

See perldoc perlre

--
Glenn Jackman
"You can only be young once. But you can always be immature." -- Dave Barry

Re: Regex to remove non printable characters

am 22.12.2007 05:04:08 von jurgenex

On Sat, 22 Dec 2007 03:54:33 +0100, Larry wrote:
> I'd like to remove all characters with ascii values > 127 from a

ASCII is a 7 bit encoding system where sometimes the eights bit is used as
parity bit. There are no ASCII characters > 127, therefore your request
doesn't make sense.

>string...that's to say i'd like to remove non printable chars...

In case you are not talking about ASCII but about e.g Windows-1252 or
ISO-Latin-x or any of the dozen other code pages that share the lower 128
characters with ASCII then please be advised that the vast majority of
those characters > 127 _ARE_ printable, at least in your typical commonly
used code pages.

The non-printable characters can be found in the lower part from 0x00 to
0x1F, no matter if ASCII or Windows-1252 or ISO-Latin-x or many, many
others.

Therefore your request makes even less sense. Maybe you want to clarify
first what you are talking about?

>is the following fine?
>
>my $input =~ s/[^ -~]+//g;

That will remove pretty much all the lower case English letters and a few
special characters. Wonder what they have to do with non-printable or
non-ASCII.

jue

Re: Regex to remove non printable characters

am 22.12.2007 05:18:19 von Dummy

Larry wrote:
>
> I'd like to remove all characters with ascii values > 127 from a
> string

$input =~ s/[^[:ascii:]]+//g;


>...that's to say i'd like to remove non printable chars...

$input =~ s/[^[:print:]]+//g;


> is the following fine?
>
> my $input =~ s/[^ -~]+//g;

my() creates a new variable with no contents so there is nothing for the
substitution operator to remove.

$ perl -wle'my $input =~ s/[^ -~]+//g;'
Use of uninitialized value in substitution (s///) at -e line 1.



John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Re: Regex to remove non printable characters

am 22.12.2007 05:53:18 von Larry

In article ,
"John W. Krahn" wrote:

> $input =~ s/[^[:ascii:]]+//g;
>
>
> >...that's to say i'd like to remove non printable chars...
>
> $input =~ s/[^[:print:]]+//g;

is this fine?

$input =~ tr/\x80-\xFF//d;

Re: Regex to remove non printable characters

am 22.12.2007 11:00:16 von rvtol+news

Larry schreef:
> John W. Krahn:

> [remove non printable chars]
> is this fine?
> $input =~ tr/\x80-\xFF//d;

No. How about chr(0x00)..chr(0x1F)?
And characters > "\x{FF}"?

--
Affijn, Ruud

"Gewoon is een tijger."

Re: Regex to remove non printable characters

am 22.12.2007 13:47:01 von jurgenex

On Sat, 22 Dec 2007 05:53:18 +0100, Larry wrote:

>In article ,
> "John W. Krahn" wrote:
>
>> $input =~ s/[^[:ascii:]]+//g;
>>
>>
>> >...that's to say i'd like to remove non printable chars...
>>
>> $input =~ s/[^[:print:]]+//g;
>
>is this fine?
>
>$input =~ tr/\x80-\xFF//d;

Depends what you are looking for (you still didn't clarify).
It will remove non-ASCII character in the typical 8-bit encodings.
It will _NOT_ remove non-printable characters.

Maybe you should make up your mind and let us know _which_ of these two
you are actually trying to do.

jue

Re: Regex to remove non printable characters

am 22.12.2007 15:26:56 von Petr Vileta

Larry wrote:
> Hi peeps,
>
> I'd like to remove all characters with ascii values > 127 from
> a string...that's to say i'd like to remove non printable chars...
>
> is the following fine?
>
> my $input =~ s/[^ -~]+//g;
>
> thanks ever so much!
Maybe this do it

my $input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x80-\xFF]//g;

--
Petr Vileta, Czech republic
(My server rejects all messages from Yahoo and Hotmail. Send me your
mail from another non-spammer site please.)

Please reply to

Re: Regex to remove non printable characters

am 23.12.2007 09:23:07 von krahnj

On Sat, 22 Dec 2007 05:53:18 +0100
Larry wrote:

> In article ,
> "John W. Krahn" wrote:
>
> > $input =~ s/[^[:ascii:]]+//g;
> >
> >
> > >...that's to say i'd like to remove non printable chars...
> >
> > $input =~ s/[^[:print:]]+//g;
>
> is this fine?
>
> $input =~ tr/\x80-\xFF//d;

Your subject line says you want a regex. The tr/// operator doesn't use regular expressions.


John
--
Perl isn't a toolbox, but a small machine shop where you
can special-order certain sorts of tools at low cost and
in short order. -- Larry Wall

Re: Regex to remove non printable characters

am 23.12.2007 18:45:07 von jurgenex

"John W. Krahn" wrote:
>Larry wrote:
>> is this fine?
>>
>> $input =~ tr/\x80-\xFF//d;
>
>Your subject line says you want a regex. The tr/// operator doesn't use regular expressions.

Good point. However, if you are splitting hairs, then let's be accurate:
Regular expressions match a string but they never remove anything as
requested by the OP. Therefore taking literally the OPs question is
non-sensical in the first place.

And he still didn't tell us if he wanted to remove non-ASCII or
non-printable, two very different categories which have no relationship with
each other whatsoever.

jue

Re: Regex to remove non printable characters

am 24.12.2007 02:29:45 von Larry

In article ,
J?rgen Exner wrote:

> And he still didn't tell us if he wanted to remove non-ASCII or
> non-printable, two very different categories which have no relationship with
> each other whatsoever.

I have yet to understand the differences...in the meanwhile I think I'll
settle for the following:

tr/\x80-\xFF//d;

thanks

Re: Regex to remove non printable characters

am 24.12.2007 04:52:30 von jurgenex

Larry wrote:

>In article ,
> J?rgen Exner wrote:
>
>> And he still didn't tell us if he wanted to remove non-ASCII or
>> non-printable, two very different categories which have no relationship with
>> each other whatsoever.
>
>I have yet to understand the differences..

Well, there is no communallity at all. It's two totally different things,
like colour and texture. A specific object can be green and smooth or green
and rough or blue and rough or blue and smooth or whatever combination you
can imagine.

Non-printable characters are characters that don't have a glyph assigned to
them and therefore cannot be printed. Another word for them is control
characters and they include e.g. line feed, carriage return, delete,
backspace, end-of-transmission, header start, etc., etc.
In ASCII and most other modern code pages the non-printable characters are
in the range 0x00 to 0x1F and 0x7F.


Non-ASCII characters on the other hand are characters that are not included
in the 7-bit ASCII encoding at all like e.g. symbols, graphics, and what
some people refer to as 'extended' characters like German umlauts, French
and Spanish accented characters, scandinavian extended characters, but also
Greek, Cyrillic, Arabic,Chinese, ... characters. Basically anything you can
imagine that is not typically used in the English language or that's not on
a US typewriter.
That's not surprising because as the name suggests ASCII is an _AMERICAN_
Standard Code for Information Interchange and Lyndon B. Johnson surely
didn't care about the rest of the world when he mandated its use back in
1968.

For e.g. ISO-Latin-1 those non-ASCII characters would be
Ax NBSP ¡ ¢ £ ¤ ¥ ¦ § ¨ ©
ª « ¬ SHY ® ¯
Bx ° ± ² ³ ´ µ ¶ · ¸ ¹
º » ¼ ½ ¾ ¿
Cx À Á Â Ã Ä Å Æ Ç È É
Ê Ë Ì Í Î Ï
Dx Ð Ñ Ò Ó Ô Õ Ö × Ø Ù
Ú Û Ü Ý Þ ß
Ex à á â ã ä å æ ç è é
ê ë ì í î ï
Fx ð ñ ò ó ô õ ö ÷ ø ù
ú û ü ý þ ÿ

However almost all non-ASCII characters do have a glyph and obviously they
can be printed very well(*), just see the list above.
Or do you really think I would just omit the second letter of my first name
'Jürgen' when printing it?

*1: You could argue if the NBSP and and in particular SHY are printable or
not because they have an additional semantic on top of their (blank resp.
dash) glyphs.
*2: There are exceptions in the code pages for more exotic languages
(Arabic, Thai, Tamil, ...) , where some characters my not have a glyph
assigned but instead they alter the appearence and/or the meaning of
preceeding or following characters.

jue

Re: Regex to remove non printable characters

am 24.12.2007 07:55:39 von Larry

In article ,
Jürgen Exner wrote:

> Well, there is no communallity at all. It's two totally different things,
> like colour and texture. A specific object can be green and smooth or green
> and rough or blue and rough or blue and smooth or whatever combination you
> can imagine.

ok...to me those are ascii printable chars:

#!/usr/bin/perl

use strict;
use warnings;

for my $k (33 .. 126)
{
print "$k => " . chr($k) . "\n";
}

plus chr(10) and chr(13)

Re: Regex to remove non printable characters

am 24.12.2007 08:19:17 von jurgenex

Larry wrote:
>ok...to me those are ascii printable chars:
>
>#!/usr/bin/perl
>
>use strict;
>use warnings;
>
>for my $k (33 .. 126)
>{
> print "$k => " . chr($k) . "\n";
>}

Agreed, those characters are the intersection of the set of printable
characters and the set of ASCII characters, except that commonly the space
character 0x20 is considered a printable character, too. It just has a blank
glyph.

>plus chr(10) and chr(13)

This however conflicts with customary understanding. From "perldoc perlre"
on POSIX character classes:

print
Any alphanumeric or punctuation (special) character or space.

While on the other hand

cntrl
Any control character. Usually characters that don't produce output
as such but instead control the terminal somehow: for example
newline and backspace are control characters. All characters with
ord() less than 32 are most often classified as control characters
(assuming ASCII, the ISO Latin character sets, and Unicode).

It appears LF and CR are control characters, not printable characters. After
all why should LF be a printable character but its cousin FF not?

jue

Re: Regex to remove non printable characters

am 24.12.2007 11:56:23 von Larry

In article ,
J?rgen Exner wrote:

> Agreed, those characters are the intersection of the set of printable
> characters and the set of ASCII characters, except that commonly the space
> character 0x20 is considered a printable character, too. It just has a blank
> glyph.

by the way, I'd like to get rid of 0x00 also! The thing is that I'm
coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
0x10 and those ranging from 0x21 to 0x7E

is that doable?

thanks

Re: Regex to remove non printable characters

am 24.12.2007 12:00:33 von Larry

In article ,
Larry wrote:

> 0x20 0x13
> 0x10 and those ranging from 0x21 to 0x7E

I'm hopeless at hex values...let's say:

chr(10)
chr(13)
chr(32) to chr(126)

thanks

Re: Regex to remove non printable characters

am 24.12.2007 12:11:53 von Larry

In article ,
Larry wrote:

> I'm hopeless at hex values...let's say:
>
> chr(10)
> chr(13)
> chr(32) to chr(126)
>
> thanks

well, for the moment I'll go along with keeping those ranging from 0x20
to 0x7E ... so that I don't have to chomp and all...

Re: Regex to remove non printable characters

am 24.12.2007 20:04:35 von jurgenex

Larry wrote:
> J?rgen Exner wrote:
> The thing is that I'm
>coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
>0x10 and those ranging from 0x21 to 0x7E

Thank you for calling me a person with a bad char.

*PLONK*

jue

Re: Regex to remove non printable characters

am 24.12.2007 20:16:36 von jurgenex

Larry wrote:

>In article ,
> Larry wrote:
>
>> I'm hopeless at hex values...let's say:
>>
>> chr(10)
>> chr(13)
>> chr(32) to chr(126)
>>
>> thanks
>
>well, for the moment I'll go along with keeping those ranging from 0x20
>to 0x7E ... so that I don't have to chomp and all...

What a concept!
I am giving up.

jue

Re: Regex to remove non printable characters

am 24.12.2007 20:33:22 von Larry

In article ,
J?rgen Exner wrote:

> What a concept!
> I am giving up.

please don't! it's xmas time after all...

i need this to get values (commands) from CGI->param and need to get rid
of those chars

Re: Regex to remove non printable characters

am 25.12.2007 01:39:26 von Larry

In article ,
"Petr Vileta" wrote:

> my $input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x80-\xFF]//g;

thank you so much ... btw, what is chr (127) ??

I think I'll make it this way:

$input =~ s/[\x00-\x09\x0B\x0C\x0E-\x1F\x7F-\xFF]//g;

thanks

Re: Regex to remove non printable characters

am 25.12.2007 06:47:19 von Charles DeRykus

On Dec 24, 11:04 am, Jürgen Exner wrote:
> Larry wrote:
> > J?rgen Exner wrote:
> > The thing is that I'm
> >coding a _strip bad chars_ sub and I would like to keep only 0x20 0x13
> >0x10 and those ranging from 0x21 to 0x7E
>
> Thank you for calling me a person with a bad char.
>
> *PLONK*
>
Wow, I thought for sure you'd finish with a
smiley after that wonderful flash of wit....
Of course, maybe you were sitting in a bad
"char" :)

--
Charles DeRykus