File handling and regex

File handling and regex

am 05.11.2007 17:15:40 von lucavilla

Hi all!

I need help with Perl under Windows command-line to solve the
following task:

I have many disordered txt files and subdirectories under the root
directory "c:\dir", like this:
c:\dir\foobar.txt
c:\dir\popo.txt
c:\dir\sub1\agsds.txt
c:\dir\sub1\popo.txt
c:\dir\sub2\hghghg.txt
c:\dir\sub2\subbb\abc.txt

These txt files are of three types:
type1: those that contain a string definable by the regular expression
"abc[0-9]+def"
type2: those that contain a string definable by the regular expression
"lmn[0-9]+opq"
type3: those that contain a string definable by the regular expression
"rst[0-9]+uvw"

I would to copy with a Perl Windows command-line script all these txt
files into a single directory "c:\output" with the filename composed
by the number found in the regex match (the "[0-9]+" part of the
regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
depending of what of the three above regexes are found in the file,
obtaining a result looking like this:
c:\output\15-type2.txt
c:\output\102-type1.txt
c:\output\33-type1.txt
c:\output\49-type3.txt
c:\output\4-type1.txt
c:\output\335-type2.txt
c:\output\32-type3.txt

How can I do it?

Re: File handling and regex

am 05.11.2007 18:52:28 von krahnj

Luca Villa wrote:
>
> I need help with Perl under Windows command-line to solve the
> following task:
>
> I have many disordered txt files and subdirectories under the root
> directory "c:\dir", like this:
> c:\dir\foobar.txt
> c:\dir\popo.txt
> c:\dir\sub1\agsds.txt
> c:\dir\sub1\popo.txt
> c:\dir\sub2\hghghg.txt
> c:\dir\sub2\subbb\abc.txt
>
> These txt files are of three types:
> type1: those that contain a string definable by the regular expression
> "abc[0-9]+def"
> type2: those that contain a string definable by the regular expression
> "lmn[0-9]+opq"
> type3: those that contain a string definable by the regular expression
> "rst[0-9]+uvw"
>
> I would to copy with a Perl Windows command-line script all these txt
> files into a single directory "c:\output" with the filename composed
> by the number found in the regex match (the "[0-9]+" part of the
> regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
> depending of what of the three above regexes are found in the file,
> obtaining a result looking like this:
> c:\output\15-type2.txt
> c:\output\102-type1.txt
> c:\output\33-type1.txt
> c:\output\49-type3.txt
> c:\output\4-type1.txt
> c:\output\335-type2.txt
> c:\output\32-type3.txt
>
> How can I do it?

*UNTESTED* YMMV :-)

#!/usr/bin/perl
use warnings;
use strict;
use File::Find;
use File::Copy;

my $from = 'c:/dir';
my $to = 'c:/output';

my %trans = qw(
abc(\d+)def type1
lmn(\d+)opq type2
rst(\d+)uvw type3
);

find sub {
return unless open my $fh, '<', $_;
return unless -f $fh;
read $fh, my $data, -s _;
close $fh;
for my $pat ( keys %trans ) {
next unless $data =~ $pat;
copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
last;
}
}, $from;

__END__



John
--
use Perl;
program
fulfillment

Re: File handling and regex

am 06.11.2007 12:27:40 von jordilin

On Nov 5, 5:52 pm, "John W. Krahn" wrote:
> Luca Villa wrote:
>
> > I need help with Perl under Windows command-line to solve the
> > following task:
>
> > I have many disordered txt files and subdirectories under the root
> > directory "c:\dir", like this:
> > c:\dir\foobar.txt
> > c:\dir\popo.txt
> > c:\dir\sub1\agsds.txt
> > c:\dir\sub1\popo.txt
> > c:\dir\sub2\hghghg.txt
> > c:\dir\sub2\subbb\abc.txt
>
> > These txt files are of three types:
> > type1: those that contain a string definable by the regular expression
> > "abc[0-9]+def"
> > type2: those that contain a string definable by the regular expression
> > "lmn[0-9]+opq"
> > type3: those that contain a string definable by the regular expression
> > "rst[0-9]+uvw"
>
> > I would to copy with a Perl Windows command-line script all these txt
> > files into a single directory "c:\output" with the filename composed
> > by the number found in the regex match (the "[0-9]+" part of the
> > regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
> > depending of what of the three above regexes are found in the file,
> > obtaining a result looking like this:
> > c:\output\15-type2.txt
> > c:\output\102-type1.txt
> > c:\output\33-type1.txt
> > c:\output\49-type3.txt
> > c:\output\4-type1.txt
> > c:\output\335-type2.txt
> > c:\output\32-type3.txt
>
> > How can I do it?
>
> *UNTESTED* YMMV :-)
>
> #!/usr/bin/perl
> use warnings;
> use strict;
> use File::Find;
> use File::Copy;
>
> my $from = 'c:/dir';
> my $to = 'c:/output';
>
> my %trans = qw(
> abc(\d+)def type1
> lmn(\d+)opq type2
> rst(\d+)uvw type3
> );
>
> find sub {
> return unless open my $fh, '<', $_;
> return unless -f $fh;
> read $fh, my $data, -s _;
> close $fh;
> for my $pat ( keys %trans ) {
> next unless $data =~ $pat;
> copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
> last;
> }
> }, $from;
>
> __END__
>
> John
> --
> use Perl;
> program
> fulfillment

One doubt,
when you write
read $fh, my $data, -s _;
should not be
read $fh, my $data, -s $_;

I have searched along the web without success. I don't know if _
equals $_ in this particular case
best regards,
jordi

Re: File handling and regex

am 06.11.2007 12:48:39 von Joe Smith

jordilin wrote:

> when you write
> read $fh, my $data, -s _;
> should not be
> read $fh, my $data, -s $_;
>
> I have searched along the web without success. I don't know if _
> equals $_ in this particular case

It does: the bare underscore refers to the most recent file test.

The "unless -f $_;" looks at the directory entity specified by $_
and gets info about it; file-or-directory, size, modification time, etc.
The "-s _" uses that info, without doing another stat() on the file.

The special use of "_" is mentioned in 'perldoc -f stat' and 'perldoc -f -x'.
-Joe

Re: File handling and regex

am 06.11.2007 13:02:48 von Josef Moellers

jordilin wrote:
> On Nov 5, 5:52 pm, "John W. Krahn" wrote:
>=20
>>Luca Villa wrote:
>>
>>
>>>I need help with Perl under Windows command-line to solve the
>>>following task:
>>
>>>I have many disordered txt files and subdirectories under the root
>>>directory "c:\dir", like this:
>>>c:\dir\foobar.txt
>>>c:\dir\popo.txt
>>>c:\dir\sub1\agsds.txt
>>>c:\dir\sub1\popo.txt
>>>c:\dir\sub2\hghghg.txt
>>>c:\dir\sub2\subbb\abc.txt
>>
>>>These txt files are of three types:
>>>type1: those that contain a string definable by the regular expression=

>>>"abc[0-9]+def"
>>>type2: those that contain a string definable by the regular expression=

>>>"lmn[0-9]+opq"
>>>type3: those that contain a string definable by the regular expression=

>>>"rst[0-9]+uvw"
>>
>>>I would to copy with a Perl Windows command-line script all these txt
>>>files into a single directory "c:\output" with the filename composed
>>>by the number found in the regex match (the "[0-9]+" part of the
>>>regex) and a "-type1.txt" or "-type2.txt" or "-type3.txt" suffix
>>>depending of what of the three above regexes are found in the file,
>>>obtaining a result looking like this:
>>>c:\output\15-type2.txt
>>>c:\output\102-type1.txt
>>>c:\output\33-type1.txt
>>>c:\output\49-type3.txt
>>>c:\output\4-type1.txt
>>>c:\output\335-type2.txt
>>>c:\output\32-type3.txt
>>
>>>How can I do it?
>>
>>*UNTESTED* YMMV :-)
>>
>>#!/usr/bin/perl
>>use warnings;
>>use strict;
>>use File::Find;
>>use File::Copy;
>>
>>my $from =3D 'c:/dir';
>>my $to =3D 'c:/output';
>>
>>my %trans =3D qw(
>> abc(\d+)def type1
>> lmn(\d+)opq type2
>> rst(\d+)uvw type3
>> );
>>
>>find sub {
>> return unless open my $fh, '<', $_;
>> return unless -f $fh;
>> read $fh, my $data, -s _;
>> close $fh;
>> for my $pat ( keys %trans ) {
>> next unless $data =3D~ $pat;
>> copy $File::Find::name, "$to/$1-$trans{$pat}.txt";
>> last;
>> }
>> }, $from;
>>
>>__END__
>>
>>John
>>--
>>use Perl;
>>program
>>fulfillment
>=20
>=20
> One doubt,
> when you write
> read $fh, my $data, -s _;
> should not be
> read $fh, my $data, -s $_;
>=20
> I have searched along the web without success. I don't know if _
> equals $_ in this particular case

No, it doesn't, at least not "literally" or conceptually.
"_" is the special filehandle which refers to the filehandle used in the =

most recently used stat operation:

"If any of the file tests (or either the "stat" or "lstat" operators)=20
are given the special filehandle consisting of a solitary underline,=20
then the stat structure of the previous file test (or stat operator) is=20
used, saving a system call."
(perldoc -f -s)


--=20
These are my personal views and not those of Fujitsu Siemens Computers!
Josef Möllers (Pinguinpfleger bei FSC)
If failure had no penalty success would not be a prize (T. Pratchett)
Company Details: http://www.fujitsu-siemens.com/imprint.html

Re: File handling and regex

am 10.11.2007 00:31:47 von lucavilla

Thanks to all and to John in particular.

John's solution perhaps worked but I had difficulty in adapting it for
my needs so I ended using this alternative solution:


use File::Find;

find(\&found, 'c:/dir');


sub found {
unless(open(IN,"<$File::Find::name")) {
warn "Could not open $File::Find::name: $! (SKIPPING)\n";
return;
}
local $/;
my $data=;
close(IN);

my($type, $number);
if($data =~ /abc([0-9]+)def/) {
$number=$1;
$type=1;
}
elsif($data =~ /lmn([0-9]+)opq/) {
$number=$1;
$type=2;
}
elsif($data =~ /rst([0-9]+)uvw/) {
$number=$1;
$type=3;
}
else {
warn "File $File::Find::name is unknown type\n";
return;
}

my $outfn="c:/output/$number-type$type.txt";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}

Re: File handling and regex

am 10.11.2007 01:39:18 von Tad McClellan

Luca Villa wrote:


> unless(open(IN,"<$File::Find::name")) {
> warn "Could not open $File::Find::name: $! (SKIPPING)\n";
> return;
> }
> local $/;
> my $data=;
> close(IN);


If you are going to mess with the special variables anyway,
then you could replace all of that with:

local @ARGV = $_;
local $/;
my $data = <>;


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Re: File handling and regex

am 10.11.2007 02:13:22 von lucavilla

> If you are going to mess with the special variables anyway,
> then you could replace all of that with:
>
> local @ARGV = $_;
> local $/;
> my $data = <>;

I received this error:
"Can't do inplace edit: . is not a regular file at c:\script.src line
12."

inplace edit? What does it want to do?

Re: File handling and regex

am 10.11.2007 14:27:36 von Tad McClellan

Luca Villa wrote:
>> If you are going to mess with the special variables anyway,
>> then you could replace all of that with:
>>
>> local @ARGV = $_;
>> local $/;
>> my $data = <>;
>
> I received this error:
> "Can't do inplace edit: . is not a regular file at c:\script.src line
> 12."


The error message has nothing to do with the code you quoted above.


> inplace edit? What does it want to do?


It wants to edit the file "inplace", that is, with the same name.

You have turned on inplace editing either with the -i command line
switch, or by setting the $^I variable somewhere...

Also, what it is trying to edit is not a file, it is a directory. You
may want to test what find() is operating on with the -d or -f filetest.


--
Tad McClellan
email: perl -le "print scalar reverse qq/moc.noitatibaher\100cmdat/"

Re: File handling and regex

am 10.11.2007 18:11:19 von lucavilla

Hi Tad,

I'm not using any argument a part of the "source.src" that contains
the script.

I started to get the error since I used your suggested substitutive
block.

This is the source.src exact content, that gives the mentioned error:

use File::Find;

find(\&found, 'c:/tempebay/1');

sub found {
local @ARGV = $_;
local $/;
my $data = <>;


my($type, $number);
if($data =~ /\s+Item number:\s+(\d+)<\/
td>/) {
$number=$1;
$type="item_description_html";
}
elsif($data =~ /Item number:\s*(\d+)<\/div>/) {
$number=$1;
$type="buyers_history_html";
}
else {
warn "File $File::Find::name is of not interesting type,
for example an eBay page of item\n";
return;
}

my $outfn="c:/tempebay/2/$number-$type.htm";
if(-e $outfn) {
warn "File $outfn already exists.\n";
return;
}
unless(open(OUT,">$outfn")) {
warn "Could not open $outfn: $!\n";
return;
}
print OUT $data;
close(OUT);
}


___

I launch: perl script.src
and despite that initial error message it actually works!

Can you understand why does it want to do that inplace edit?