regexp to "clean" a text file

regexp to "clean" a text file

am 01.10.2008 18:12:52 von Alejandro Santillan

I am trying to clean up a file (file.txt) which was part of a database =

and it comes with a lot of strange characters mixed up with the text. =

The output file is file2.txt

I would like to read the file into a variable called $text, create a =

regular expression that cleans it and print again the output file =

eliminating non readable chars.

$text is the variable that contains the text to be cleaned.

This is the content of the input file called file.txt:

^E ^E ^E ÿÝ.Anodizado
^@ Ultima =

actualizaci=F3n: 06-Mar-2004 ^@http://www.kr2-egb.com.ar/ =

anodizado.htm^@
^@=BFQue es el anodizado?
^@Cuando escuchamos este termino, lo =

primero que se nos cruza por la cabeza es el coloreado del aluminio, =

pues algo de eso tiene, pero en si el proceso de anodizado es una =

forma de proteger el aluminio contra de los agentes atmosf=E9ricos. =

Luego del extru=EDdo o decapado, este material entra en contacto con el =

aire y forma por si solo una delgada pel=EDcula de oxido con un espesor =

mas o menos regular de 0,01 micrones denominada oxido de aluminio, =

esta tiene algunas m=EDnimas propiedades protectoras.^@Bien, el proceso =

de anodizado consiste en obtener de manera artificial pel=EDculas de =

oxido de mucho mas espesor y con mejores caracter=EDsticas de protecci=F3n =

que las capas naturales, estas se obtienen mediante procesos qu=EDmicos =

y electrol=EDticos. Artificialmente se pueden obtener pel=EDculas en las =

que el espesor es de 25/30 micrones en el tratamiento de protecci=F3n o =

decoraci=F3n y de casi 100 micrones con el procedimiento de =

endurecimiento superficial (Anodizado Duro).

The script I use to read the file is:

#!/usr/bin/perl

open(FILE," @text=3D;
close(FILE);

$text=3D join "",@text;
#$text=3D~s/![a-zA-Z][0-9]//g; #not used so far
print $text;

open(FILE,">file2.txt") or die;
print FILE $text;
close(FILE);


When I print the $text variable to STDOUT, the output replaces (1) =

chars as for example =F1, =E1, =E9, =ED, =F3, =FA by a question mark as fol=
lows:

??.Anodizado
Ultima actualizaci?n: 06- =

Mar-2004 http://www.kr2-egb.com.ar/anodizado.htm
?Que es el anodizado?
Cuando escuchamos este termino, lo primero =

que se nos cruza por la cabeza es el coloreado del aluminio, pues algo =

de eso tiene, pero en si el proceso de anodizado es una forma de =

proteger el aluminio contra de los agentes atmosf?ricos. Luego del =

extru?do o decapado, este material entra en contacto con el aire y =

forma por si solo una delgada pel?cula de oxido con un espesor mas o =

menos regular de 0,01 micrones denominada oxido de aluminio, esta =

tiene algunas m?nimas propiedades protectoras.Bien, el proceso de =

anodizado consiste en obtener de manera artificial pel?culas de oxido =

de mucho mas espesor y con mejores caracter?sticas de protecci?n que =

las capas naturales, estas se obtienen mediante procesos qu?micos y =

electrol?ticos. Artificialmente se pueden obtener pel?culas en las que =

el espesor es de 25/30 micrones en el tratamiento de protecci?n o =

decoraci?n y de casi 100 micrones con el procedimiento de =

endurecimiento superficial (Anodizado Duro).

And it also eliminates (2) chars as ^E ^E =

^E ÿÝ ^@

I really like to eliminate those chars (2), but not the chars (1): =F1, =

=E1, =E9, =ED, =F3, =FA

Anyway, after reading the contents in the file2.txt, no change in the =

is produced with respect to the contents of file.txt
file.txt and file2.txt are identical.

I would like to clean the file of the type (1) chars but keep the type =

(2) chars.
What's the way to do it?



_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: regexp to "clean" a text file

am 01.10.2008 22:44:44 von Williamawalters

--===============0459654198==
Content-Type: multipart/alternative;
boundary="-----------------------------1222893884"


-------------------------------1222893884
Content-Type: text/plain; charset="US-ASCII"
Content-Transfer-Encoding: 7bit


hi anonymous --

the first thing to do is to be very clear about exactly what the characters
are that you are trying to eliminate -- and those you are trying to keep!.

you do not say what character set you are dealing with -- ascii, utf8,
utf16,
etc., etc. it would be nice to know this also.

it would also be nice to the operating system and perl version you are
working with.

one way to find out about actual characters is to use a hex dump utility
of some kind. is what displays in my e-mail as ``^E'' (carat-E) really a
carat character followed by an upper-case E character, or is it a
control-E (ascii 0x05 ``ENQ'')? likewise, is ^@ (carat-@) a control-@
(ascii 0x00 ``NUL'') character? what about all the whitespace that
surrounds these characters in my e-mail: is that really there?

another important step is to familiarize yourself with regex format --
perlre, perlretut and perlrequick are important here.

one quick point is that the regex expression s/![a-zA-Z][0-9]//g does
not negate the character classes that follow it: the ``!'' character is not
special in a regex, it is literally a ``!'', an exclamation mark. you
might
want something like s/[^a-zA-Z0-9]//g instead -- however, this will also
delete the accented characters you say you want to keep.

if you just want to eliminate ascii control characters, the regex
s/[\x00-\x1f]//g would, i think, do the trick. try something like

perl -i.bak -lpe "s/[\x00-\x1f]//g" input.file

on a COPY (and in a separate directory) of the file you are trying to
fix. (i am assuming you are running windows.)

hth -- bill walters





**************Looking for simple solutions to your real-life financial
challenges? Check out WalletPop for the latest news and information, tips and
calculators. (http://www.walletpop.com/?NCID=emlcntuswall00000001)

-------------------------------1222893884
Content-Type: text/html; charset="US-ASCII"
Content-Transfer-Encoding: quoted-printable





Arial"=20
bottomMargin=3D7 leftMargin=3D7 topMargin=3D7 rightMargin=3D7> e_document=20
face=3DArial color=3D#000000 size=3D2>


hi anonymous --  

 

the first thing to do is to be very clear about exactly what the charac=
ters=20

are that you are trying to eliminate -- and those you are trying to=20
keep!.  

 

you do not say what character set you are dealing with -- ascii, utf8,=20
utf16,

etc., etc.   it would be nice to know this=20
also.  

 

it would also be nice to the operating system and perl version you are=20

working with.  

 

one way to find out about actual characters is to use a hex dump utilit=
y=20

of some kind.   is what displays in my e-mail as ``^E'' (cara=
t-E)=20
really a

carat character followed by an upper-case E character, or is it a >
control-E (ascii 0x05 ``ENQ'')?   likewise, is ^@ (carat-@) a=
=20
control-@

(ascii 0x00 ``NUL'') character?   what about all the whitespa=
ce=20
that

surrounds these characters in my e-mail: is that really there? &nb=
sp;=20

 

another important step is to familiarize yourself with regex format --=20

perlre, perlretut and perlrequick are important here.   >
 

one quick point is that the regex expression =20
s/![a-zA-Z][0-9]//g  does

not negate the character classes that follow it: the ``!'' character is=
not=20

special in a regex, it is literally a ``!'', an exclamation=20
mark.   you might

want something like  s/[^a-zA-Z0-9]//g  instead -- however, t=
his=20
will also

delete the accented characters you say you want to keep.   DIV>
 

if you just want to eliminate ascii control characters, the regex =
=20

s/[\x00-\x1f]//g  would, i think, do the trick.   try=20
something like

 

perl  -i.bak  -lpe "s/[\x00-\x1f]//g" =20
input.file  

 

on a COPY (and in a separate directory) of the file you are trying to=20

fix.   (i am assuming you are running windows.)  =20

 

hth -- bill walters  

 



DY>

-------------------------------1222893884--

--===============0459654198==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0459654198==--

Re: regexp to "clean" a text file

am 02.10.2008 20:45:03 von Alejandro Santillan

--===============0087220443==
Content-Type: multipart/alternative; boundary=Apple-Mail-266--147232572


--Apple-Mail-266--147232572
Content-Type: text/plain;
charset=ISO-8859-1;
format=flowed;
delsp=yes
Content-Transfer-Encoding: quoted-printable

Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the =20
trick, almost perfect.
The original file is the palm database of memo pads. The text is =20
there, plain. Several mixed control characters were present.
The system I working on is a Fedora linux box. I have no hex utility =20
installed to make de dump, so I don't know if the ^E is really a ^E.
Anyway it flew away after executing:

open(FILE,"<046.txt") or die;
@text=3D;
close(FILE);

$text=3D join "",@text;

$text=3D~s/[\x00-\x1f]//g;
print $text;

open(FILE,">file2.txt") or die;
print FILE $text;
close(FILE);

file2.txt now reads almost perfect (I add the -------------- for =20
clarity):
-------------------------------------------
ÿÝ.Anodizado
Ultima actualizaci=F3n: 06-Mar-2004 =
http://www.kr2-egb.com.ar/anodizado.htm=20
=BFQue es el anodizado?
Cuando escuchamos este termino, lo primero que se nos cruza por la =20
cabeza es el coloreado del aluminio, pues algo de eso tiene, pero en =20
si el proceso de anodizado es una forma de proteger el aluminio contra =20=

de los agentes atmosf=E9ricos. Luego del extru=EDdo o decapado, este =20
material entra en contacto con el aire y forma por si solo una delg.. =20=

Some more plain spanish text here ...Volver al inicio =20
:=F0, . <84> .
-------------------------------------------

Except for those chars ÿÝ at the beginning and =20
the :=F0, . <84> at the end, but no bad chars in =
between.
I add this:
$text=3D~s/ÿÝ//g;
$text=3D~s/=DD//g;
$text=3D~s/=FF//g;
$text=3D~s/<84>//g;
But nothing happened. In particular in my vi editor the char <84> =20
appears in blue, whereas the rest is black.

Any idea what to do with them?

Thank you again. Very helpful so far.

Alejandro



On Oct 1, 2008, at 4:44 PM, Williamawalters@aol.com wrote:

> hi anonymous --
>
> the first thing to do is to be very clear about exactly what the =20
> characters
> are that you are trying to eliminate -- and those you are trying to =20=

> keep!.
>
> you do not say what character set you are dealing with -- ascii, =20
> utf8, utf16,
> etc., etc. it would be nice to know this also.
>
> it would also be nice to the operating system and perl version you are
> working with.
>
> one way to find out about actual characters is to use a hex dump =20
> utility
> of some kind. is what displays in my e-mail as ``^E'' (carat-E) =20
> really a
> carat character followed by an upper-case E character, or is it a
> control-E (ascii 0x05 ``ENQ'')? likewise, is ^@ (carat-@) a =20
> control-@
> (ascii 0x00 ``NUL'') character? what about all the whitespace that
> surrounds these characters in my e-mail: is that really there?
>
> another important step is to familiarize yourself with regex format --
> perlre, perlretut and perlrequick are important here.
>
> one quick point is that the regex expression s/![a-zA-Z][0-9]//g =20
> does
> not negate the character classes that follow it: the ``!'' character =20=

> is not
> special in a regex, it is literally a ``!'', an exclamation mark. =20=

> you might
> want something like s/[^a-zA-Z0-9]//g instead -- however, this =20
> will also
> delete the accented characters you say you want to keep.
>
> if you just want to eliminate ascii control characters, the regex
> s/[\x00-\x1f]//g would, i think, do the trick. try something like
>
> perl -i.bak -lpe "s/[\x00-\x1f]//g" input.file
>
> on a COPY (and in a separate directory) of the file you are trying to
> fix. (i am assuming you are running windows.)
>
> hth -- bill walters
>
>
>
>
> Looking for simple solutions to your real-life financial challenges? =20=

> Check out WalletPop for the latest news and information, tips and =20
> calculators.
> _______________________________________________
> ActivePerl mailing list
> ActivePerl@listserv.ActiveState.com
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs


--Apple-Mail-266--147232572
Content-Type: text/html;
charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

-webkit-line-break: after-white-space; ">Thank you William, Bill and =
Tim. Finally  Arial; font-size: 10px; ">s/[\x00-\x1f]//g did the trick, almost =
perfect.

The original file is the palm database of memo =
pads. The text is there, plain. Several mixed control characters were =
present.
The system I working on is a Fedora linux box. I have =
no hex utility installed to make de dump, so I don't know if the ^E is =
really a ^E.
Anyway it flew away after =
executing:

open(FILE,"<046.txt") or =
die;
@text=3D<FILE>;
close(FILE);

iv>
$text=3D join =
"",@text;

$text=3D~s/[\x00-\x1f]//g;
pr=
int $text;

open(FILE,">file2.txt") or =
die;
print FILE =
$text;
close(FILE);

file2.txt =
now reads almost perfect (I add the -------------- for =
clarity):
-------------------------------------------
=
                 =
            =
ÿÝ.Anodizado
    Ultima actualizaci=F3n: =
06-Mar-2004     href=3D"http://www.kr2-egb.com.ar/anodizado.htm">http://www. kr2-egb.com.ar=
/anodizado.htm
=BFQue es el anodizado?
Cuando escuchamos =
este termino, lo primero que se nos cruza por la cabeza es el coloreado =
del aluminio, pues algo de eso tiene, pero en si el proceso de anodizado =
es una forma de proteger el aluminio contra de los agentes atmosf=E9ricos.=
Luego del extru=EDdo o decapado, este material entra en contacto con el =
aire y forma por si solo una delg.. Some more plain spanish text =
here ...Volver al inicio <index.htm>      =
 :=F0,    .           =
 <84>     =
..
-------------------------------------------
>
Except for those chars ÿÝ at the beginning and =
the  :=F0,    .           =
 <84>   at the end, but no bad chars in =
between.
I add =
this:
$text=3D~s/ÿÝ//g;
$text=3D~s/=DD//g; iv>
$text=3D~s/=FF//g;
$text=3D~s/<84>//g;
But =
nothing happened. In particular in my vi editor the char <84> appears =
in blue, whereas the rest is black.

Any =
idea what to do with them?

Thank you again. =
Very helpful so =
far.

Alejandro

iv>


type=3D"cite">
#000000; FONT-FAMILY: Arial" bottommargin=3D"7" leftmargin=3D"7" =
topmargin=3D"7" rightmargin=3D"7"> face=3D"Arial" color=3D"#000000" size=3D"2">
hi anonymous =
--  
 
the first thing to do is to =
be very clear about exactly what the characters
are that you =
are trying to eliminate -- and those you are trying to =
keep!.  
 
you do not say what =
character set you are dealing with -- ascii, utf8, utf16,
=
etc., etc.   it would be nice to know this =
also.  
 
it would also be nice to =
the operating system and perl version you are
working =
with.  
 
one way to find out =
about actual characters is to use a hex dump utility
of some =
kind.   is what displays in my e-mail as ``^E'' (carat-E) =
really a
carat character followed by an upper-case E =
character, or is it a
control-E (ascii 0x05 =
``ENQ'')?   likewise, is ^@ (carat-@) a control-@
=
(ascii 0x00 ``NUL'') character?   what about all the =
whitespace that
surrounds these characters in my e-mail: is =
that really there?  
 
another =
important step is to familiarize yourself with regex format --
=
perlre, perlretut and perlrequick are important here.   =
 
one quick point is that the regex =
expression  s/![a-zA-Z][0-9]//g  does
not negate =
the character classes that follow it: the ``!'' character is not
=
special in a regex, it is literally a ``!'', an exclamation =
mark.   you might
want something like  =
s/[^a-zA-Z0-9]//g  instead -- however, this will also
=
delete the accented characters you say you want to =
keep.  
 
if you just want to =
eliminate ascii control characters, the regex 
=
s/[\x00-\x1f]//g  would, i think, do the trick.   =
try something like
 
perl  -i.bak  =
-lpe "s/[\x00-\x1f]//g"  input.file  
=
 
on a COPY (and in a separate directory) of the =
file you are trying to
fix.   (i am assuming you =
are running windows.)  
 
hth -- =
bill walters  
=
 



=
_______________________________________________
ActivePerl mailing =
list
href=3D"mailto:ActivePerl@listserv.ActiveState.com">ActivePe rl@listserv.Ac=
tiveState.com

To unsubscribe: =
http://listserv.ActiveState.com/mailman/mysubs

>
=

--Apple-Mail-266--147232572--

--===============0087220443==
Content-Type: text/plain; charset="us-ascii"
MIME-Version: 1.0
Content-Transfer-Encoding: 7bit
Content-Disposition: inline

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs
--===============0087220443==--

RE: regexp to "clean" a text file

am 03.10.2008 13:33:52 von Brian Raven

From: activeperl-bounces@listserv.ActiveState.com
[mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of
Alejandro Santillan Iturres
Sent: 02 October 2008 19:45
To: activeperl@listserv.ActiveState.com
Cc: Williamawalters@aol.com
Subject: Re: regexp to "clean" a text file

> Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the
trick, almost perfect.
> The original file is the palm database of memo pads. The text is
there, plain. Several mixed control characters > were present.
> The system I working on is a Fedora linux box. I have no hex utility
installed to make de dump, so I don't know > if the ^E is really a ^E.

I find that a little hard to believe. Try 'hexdump', or if that isn't
present you should at least have 'od'. If neither of them are installed,
you Linux installation sounds a bit broken. Unless you can identify
which characters are to be kept or discarded, you will find it difficult
to 'clean' your data effectively.

HTH

--
Brian Raven

------------------------------------------------------------ -----------------------------------------------
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.


_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: regexp to "clean" a text file

am 05.10.2008 01:53:57 von Alejandro Santillan

Yesss, od and hexdump are present on the system.

I did hexdump file2.txt, where file2.txt has the following contents:

ÿÝ.Anodizado
Ultima actualizaci=F3n: 06-Mar-2004 http://www.kr2-egb.com.ar/anod=
izado.htm =

=BFQue es el anodizado?
Pagina creada en noviembre del 2003 =AB Volver al inicio =

:=F0, . <84> .


and the dump was:

0000000 2020 2020 2020 2020 2020 2020 2020 2020
0000010 2020 2020 2020 2020 2020 2020 2020 ddff
0000020 412e 6f6e 6964 617a 6f64 200a 2020 5520
0000030 746c 6d69 2061 6361 7574 6c61 7a69 6361
0000040 f369 3a6e 3020 2d36 614d 2d72 3032 3430
0000050 2020 2020 6820 7474 3a70 2f2f 7777 2e77
0000060 726b 2d32 6765 2e62 6f63 2e6d 7261 612f
0000070 6f6e 6964 617a 6f64 682e 6d74 51bf 6575
0000080 6520 2073 6c65 6120 6f6e 6964 617a 6f64
0000090 0a3f 6150 6967 616e 6320 6572 6461 2061
00000a0 6e65 6e20 766f 6569 626d 6572 6420 6c65
00000b0 3220 3030 2033 2020 ab20 5620 6c6f 6576
00000c0 2072 6c61 6920 696e 6963 206f 693c 646e
00000d0 7865 682e 6d74 203e 2020 2020 2020 3a20
00000e0 2cf0 2020 2020 202e 2020 2020 2020 2020
00000f0 2020 8420 2020 2020 2e20 2020 2020 2020
0000100 2020 2020 2020 2020 2020 000a
000010b

Is this helpful?

I reduced the file a bit more, just to contain the chars I want to =

erase so now it is:

ÿÝ.Anodizado
:=F0, . <84> .


and the corresponding hex dump is:

0000000 2020 2020 2020 2020 2020 2020 2020 2020
0000010 2020 2020 2020 2020 2020 2020 2020 ddff
0000020 412e 6f6e 6964 617a 6f64 200a 2020 2020
0000030 2020 3a20 2cf0 2020 2020 202e 2020 2020
0000040 2020 2020 2020 8420 2020 2020 2e20 2020
0000050 2020 2020 2020 2020 2020 2020 2020 000a
000005f

And this is what remain if I erase the word "Anodizado":

0000000 2020 2020 2020 2020 2020 2020 2020 2020
0000010 2020 2020 2020 2020 2020 2020 ff20 2edd
0000020 200a 2020 2020 2020 3a20 2cf0 2020 2020
0000030 202e 2020 2020 2020 2020 2020 3c20 3438
0000040 203e 2020 2020 0a2e
0000048

I've tried also this:
hexdump -c file2.txt

And the result was:
0000000
0000010 377 335 . \n
0000020 : 360 , .
0000030 < 8 4 > .
0000040 \n
0000041

So I did:
$text=3D~s/\377//g;
$text=3D~s/\335//g;
$text=3D~s/\360//g;
$text=3D~s/\204//g;

And this cleaned a bit more. Any suggestions?






On Oct 3, 2008, at 7:33 AM, Brian Raven wrote:

> From: activeperl-bounces@listserv.ActiveState.com
> [mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of
> Alejandro Santillan Iturres
> Sent: 02 October 2008 19:45
> To: activeperl@listserv.ActiveState.com
> Cc: Williamawalters@aol.com
> Subject: Re: regexp to "clean" a text file
>
>> Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the
> trick, almost perfect.
>> The original file is the palm database of memo pads. The text is
> there, plain. Several mixed control characters > were present.
>> The system I working on is a Fedora linux box. I have no hex utility
> installed to make de dump, so I don't know > if the ^E is really a ^E.
>
> I find that a little hard to believe. Try 'hexdump', or if that isn't
> present you should at least have 'od'. If neither of them are =

> installed,
> you Linux installation sounds a bit broken. Unless you can identify
> which characters are to be kept or discarded, you will find it =

> difficult
> to 'clean' your data effectively.
>
> HTH
>
> -- =

> Brian Raven
>
> ------------------------------------------------------------ -------------=
----------------------------------
> This e-mail may contain confidential and/or privileged information. =

> If you are not the intended recipient or have received this e-mail =

> in error, please advise the sender immediately by reply e-mail and =

> delete this message and any attachments without retaining a copy. =

> Any unauthorised copying, disclosure or distribution of the material =

> in this e-mail is strictly forbidden.
>
>
> _______________________________________________
> ActivePerl mailing list
> ActivePerl@listserv.ActiveState.com
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: regexp to "clean" a text file

am 05.10.2008 02:22:56 von Alejandro Santillan

I've tried this wonderful command:
hexadump -c file.txt
and I found that I have to include more and more chars to erase as
follows:

$text=~s/\177//g;
$text=~s/\377//g;
$text=~s/\335//g;
$text=~s/\360//g;
$text=~s/\204//g;
$text=~s/\222/\n/g;
$text=~s/\214//g;
$text=~s/\216//g;
$text=~s/\224//g;
$text=~s/\240//g;
$text=~s/\237//g;
$text=~s/\234//g;
$text=~s/\325//g;
$text=~s/\351//g;
$text=~s/\352//g;
$text=~s/\355//g;
$text=~s/\361//g;
$text=~s/\362//g;
$text=~s/\366//g;

Is there a way to erase all the chars that are higher than, say, 300?
Does this make sense?

Thank you again!

On Oct 3, 2008, at 7:33 AM, Brian Raven wrote:

> From: activeperl-bounces@listserv.ActiveState.com
> [mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of
> Alejandro Santillan Iturres
> Sent: 02 October 2008 19:45
> To: activeperl@listserv.ActiveState.com
> Cc: Williamawalters@aol.com
> Subject: Re: regexp to "clean" a text file
>
>> Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the
> trick, almost perfect.
>> The original file is the palm database of memo pads. The text is
> there, plain. Several mixed control characters > were present.
>> The system I working on is a Fedora linux box. I have no hex utility
> installed to make de dump, so I don't know > if the ^E is really a ^E.
>
> I find that a little hard to believe. Try 'hexdump', or if that isn't
> present you should at least have 'od'. If neither of them are
> installed,
> you Linux installation sounds a bit broken. Unless you can identify
> which characters are to be kept or discarded, you will find it
> difficult
> to 'clean' your data effectively.
>
> HTH
>
> --
> Brian Raven
>
> ------------------------------------------------------------ -----------------------------------------------
> This e-mail may contain confidential and/or privileged information.
> If you are not the intended recipient or have received this e-mail
> in error, please advise the sender immediately by reply e-mail and
> delete this message and any attachments without retaining a copy.
> Any unauthorised copying, disclosure or distribution of the material
> in this e-mail is strictly forbidden.
>
>
> _______________________________________________
> ActivePerl mailing list
> ActivePerl@listserv.ActiveState.com
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: regexp to "clean" a text file

am 05.10.2008 05:16:30 von Bill Luebkert

Alejandro Santillan Iturres wrote:
> Yesss, od and hexdump are present on the system.
> =

> I did hexdump file2.txt, where file2.txt has the following contents:
> =

> ÿÝ.Anodizado
> Ultima actualizaci=F3n: 06-Mar-2004 http://www.kr2-egb.com.ar/an=
odizado.htm =

> =BFQue es el anodizado?
> Pagina creada en noviembre del 2003 =AB Volver al inicio =

> :=F0, . <84> .
> =

> =

> and the dump was:
> =

> 0000000 2020 2020 2020 2020 2020 2020 2020 2020
> 0000010 2020 2020 2020 2020 2020 2020 2020 ddff
> 0000020 412e 6f6e 6964 617a 6f64 200a 2020 5520
> 0000030 746c 6d69 2061 6361 7574 6c61 7a69 6361
> 0000040 f369 3a6e 3020 2d36 614d 2d72 3032 3430
> 0000050 2020 2020 6820 7474 3a70 2f2f 7777 2e77
....

> Is this helpful?

I still think what you should do is make an array of 256 characters and
index the array for each incoming character and replace the ones that
need replacing with a space or a binary 0 and then delete the binary
0's after with a tr/\x00//d - basically a translate table and a delete.

Otherwise you have a whole bunch of s/// to do on each line which could
get expensive. You probably could construct a $line =3D~ tr/$from/$to/
to do the job also - where $from has the characters you want to replace
and $to has the replacement characters.

> So I did:
> $text=3D~s/\377//g;
> $text=3D~s/\335//g;
> $text=3D~s/\360//g;
> $text=3D~s/\204//g;
> =

> And this cleaned a bit more. Any suggestions?

Here's some sample ideas - you may be able to speed it up by
benchmarking a few other ways:

use strict;
use warnings;

my @TT =3D (); # translate table
for (my $ii =3D 0; $ii < 256; ++$ii) {
$TT[$ii] =3D chr $ii; # populate with default
}

# sample line
my $line1 =3D "=05 =05 =05 ÿÝ.Anodizado
=01\x00 =
Ultima\n";

# characters to change
my %h =3D ( # characters to substitute/delete
chr (5) =3D> ' ', # space
chr (1) =3D> ' ', # space
chr (0xdd) =3D> chr 0, # delete
chr (0xff) =3D> chr 0, # delete
);

# modify translate table for %H hash subtitutes
foreach my $key (keys %h) {
$TT[ord $key] =3D $h{$key};
# printf "\$key=3D%c(0x%02x) =3D> 0x%02x\n", ord ($key), ord ($key),
# ord $h{$key};
}

my @lines =3D ($line1);

# do for each line

foreach my $line (@lines) { # do a line at a time

print "line before=3D'$line'\n";

my $len =3D length $line;
for (my $ii =3D 0; $ii < $len; ++$ii) {
# translate each character
my $ch =3D substr $line, $ii, 1;
my $ci =3D ord $ch;
substr $line, $ii, 1, $TT[$ci];
}

$line =3D~ tr/\x00//d; # drop \x00's

print "line after=3D'$line'\n";
}

__END__
_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

Re: regexp to "clean" a text file

am 07.10.2008 12:27:28 von fzarabozo

Hello Alejandro,

Does something like this work for you?

use encoding 'latin1';
$text =3D~ s/[^\w\s\n\.\=BF\?\=A1\!\:\/\\\<\>]//ig;
$text =3D~ s/[\ä\=EB\=EF\ö\=FF\=FD\=F0]//ig;

Please let me know.

HTH

Paco Zarabozo


------------------------------------------------------------ -----
From: Alejandro Santillan Iturres
Sent: Saturday, October 04, 2008 7:22 PM
To: Brian Raven ; Williamawalters@aol.com
Cc: activeperl@listserv.ActiveState.com
Subject: Re: regexp to "clean" a text file


I've tried this wonderful command:
hexadump -c file.txt
and I found that I have to include more and more chars to erase as
follows:

$text=3D~s/\177//g;
$text=3D~s/\377//g;
$text=3D~s/\335//g;
$text=3D~s/\360//g;
$text=3D~s/\204//g;
$text=3D~s/\222/\n/g;
$text=3D~s/\214//g;
$text=3D~s/\216//g;
$text=3D~s/\224//g;
$text=3D~s/\240//g;
$text=3D~s/\237//g;
$text=3D~s/\234//g;
$text=3D~s/\325//g;
$text=3D~s/\351//g;
$text=3D~s/\352//g;
$text=3D~s/\355//g;
$text=3D~s/\361//g;
$text=3D~s/\362//g;
$text=3D~s/\366//g;

Is there a way to erase all the chars that are higher than, say, 300?
Does this make sense?

Thank you again!

On Oct 3, 2008, at 7:33 AM, Brian Raven wrote:

> From: activeperl-bounces@listserv.ActiveState.com
> [mailto:activeperl-bounces@listserv.ActiveState.com] On Behalf Of
> Alejandro Santillan Iturres
> Sent: 02 October 2008 19:45
> To: activeperl@listserv.ActiveState.com
> Cc: Williamawalters@aol.com
> Subject: Re: regexp to "clean" a text file
>
>> Thank you William, Bill and Tim. Finally s/[\x00-\x1f]//g did the
> trick, almost perfect.
>> The original file is the palm database of memo pads. The text is
> there, plain. Several mixed control characters > were present.
>> The system I working on is a Fedora linux box. I have no hex utility
> installed to make de dump, so I don't know > if the ^E is really a ^E.
>
> I find that a little hard to believe. Try 'hexdump', or if that isn't
> present you should at least have 'od'. If neither of them are
> installed,
> you Linux installation sounds a bit broken. Unless you can identify
> which characters are to be kept or discarded, you will find it
> difficult
> to 'clean' your data effectively.
>
> HTH
>
> -- =

> Brian Raven
>
> ------------------------------------------------------------ -------------=
----------------------------------
> This e-mail may contain confidential and/or privileged information.
> If you are not the intended recipient or have received this e-mail
> in error, please advise the sender immediately by reply e-mail and
> delete this message and any attachments without retaining a copy.
> Any unauthorised copying, disclosure or distribution of the material
> in this e-mail is strictly forbidden.
>
>
> _______________________________________________
> ActivePerl mailing list
> ActivePerl@listserv.ActiveState.com
> To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs =


_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs

RE: regexp to "clean" a text file

am 07.10.2008 13:37:05 von Brian Raven

Alejandro Santillan Iturres <> wrote:
> I've tried this wonderful command:
> hexadump -c file.txt
> and I found that I have to include more and more chars to erase as
> follows:
>
> $text=~s/\177//g;
> $text=~s/\377//g;
> $text=~s/\335//g;
> $text=~s/\360//g;
> $text=~s/\204//g;
> $text=~s/\222/\n/g;
> $text=~s/\214//g;
> $text=~s/\216//g;
> $text=~s/\224//g;
> $text=~s/\240//g;
> $text=~s/\237//g;
> $text=~s/\234//g;
> $text=~s/\325//g;
> $text=~s/\351//g;
> $text=~s/\352//g;
> $text=~s/\355//g;
> $text=~s/\361//g;
> $text=~s/\362//g;
> $text=~s/\366//g;
>
> Is there a way to erase all the chars that are higher than, say, 300?
> Does this make sense?

Try:

$text =~ tr/\300-\377//d;

or with control characters (i.e. less than space) as well:

$text =~ tr/\000-\037\300-\377//d;

HTH

--
Brian Raven

------------------------------------------------------------ -----------------------------------------------
This e-mail may contain confidential and/or privileged information. If you are not the intended recipient or have received this e-mail in error, please advise the sender immediately by reply e-mail and delete this message and any attachments without retaining a copy. Any unauthorised copying, disclosure or distribution of the material in this e-mail is strictly forbidden.


_______________________________________________
ActivePerl mailing list
ActivePerl@listserv.ActiveState.com
To unsubscribe: http://listserv.ActiveState.com/mailman/mysubs