Removing octal characters from a file

am 07.09.2007 18:29:39 von paintedjazz

Is there a way to remove octal characters e.g. \302\271 or
\342\204\242 using perl or sed or awk. What I would prefer to do is
remove all globally with one command. I'm not sure how I would enter
them as a range or even if that's even possible.

If it's not asking too much, is there also a way to incorporate one
exception into this to replace (rather than remove) octal chars used
by Microsoft instead of a simple apostrophe. Thanks a bunch for any
help.

Re: Removing octal characters from a file

am 07.09.2007 19:43:45 von Icarus Sparry

On Fri, 07 Sep 2007 16:29:39 +0000, paintedjazz wrote:

> Is there a way to remove octal characters e.g. \302\271 or \342\204\242
> using perl or sed or awk. What I would prefer to do is remove all
> globally with one command. I'm not sure how I would enter them as a
> range or even if that's even possible.
>
> If it's not asking too much, is there also a way to incorporate one
> exception into this to replace (rather than remove) octal chars used by
> Microsoft instead of a simple apostrophe. Thanks a bunch for any help.

With perl it is easy, the only question is what characters to keep.
Here I keep \010 (backspace) \011 (tab) and \012 (linefeed, used as
newline by unix), fram the control characters, \040 (space) to \176 (~).

perl -pi.bak -e 's/[\000-\007\013-\037\177-\377]//g;' filename

If you know what characters microsoft use, then you can certainly use

perl -pi.bak -e 's/MQC/'\''/g; s/[\000-\007\013-\037\177-\377]//g' filename

where MQC is the microsoft code for a quote.

Re: Removing octal characters from a file

am 07.09.2007 19:49:15 von Janis Papanagnou

paintedjazz@gmail.com wrote:
> Is there a way to remove octal characters

What do you think "octal characters" are?

> e.g. \302\271 or
> \342\204\242 using perl or sed or awk.

The above are just character strings; escaped representations of 8-bit
values composed by octal digits.

Do you want to remove the character that may be displayed as "\302" or
do you want to remove the four-character-sequence '\', '3', '0', '2'
from your data?

> What I would prefer to do is
> remove all globally with one command. I'm not sure how I would enter
> them as a range or even if that's even possible.

If you have just charcter ranges to remove then use tr -d .

But if you have the character string representation as given above it
might be easier to do it in two steps; first transform valid sequences
of "\[0-3][0-7][0-7]" into the respective character, then use tr -d
with a range of characters (possibly also specified octal) to delete.

If you explain your task clearer we can help you further in detail.

> If it's not asking too much, is there also a way to incorporate one
> exception into this to replace (rather than remove) octal chars used
> by Microsoft instead of a simple apostrophe. Thanks a bunch for any
> help.

What do you mean by "octal chars used by Microsoft"?

Again, if it's just the characters then use tr (in this case without
option -d) as in, for example, tr \' \" or tr A-Z a-z

If you choose to use awk you may use the same function gsub() for both,
replace and remove.

Janis

Re: Removing octal characters from a file

am 07.09.2007 21:39:48 von Cyrus Kriticos

paintedjazz@gmail.com wrote:
> Is there a way to remove octal characters e.g. \302\271 or
> \342\204\242 using perl or sed or awk. What I would prefer to do is
> remove all globally with one command.

$ echo 'ab\cdd23444\342333\204\242' | sed 's/\\[0-3][0-7][0-7]//g'
ab\cdd23444333

--
Best regards | "The only way to really learn scripting is to write
Cyrus | scripts." -- Advanced Bash-Scripting Guide