Optimized count of files in tree

am 12.08.2007 15:46:36 von patrick

Hello,

In an application I write in Perl, I must count the total number of
files (not directories) in a complete tree.

What is the most efficient way to do it ?

My current code is :

use File::Recurse;
my $nb = 0;
recurse { -f && $nb++ } $dir;

With this code, I can scan 10,000 files in 15 seconds.

I want to know if it exists a best way (= quicker) to do it ?

Thanks for your help.

Patrick

Re: Optimized count of files in tree

am 12.08.2007 15:57:19 von Paul Lalli

On Aug 12, 9:46 am, Patrick wrote:
> In an application I write in Perl, I must count the total number of
> files (not directories) in a complete tree.
>
> What is the most efficient way to do it ?
>
> My current code is :
>
> use File::Recurse;
> my $nb = 0;
> recurse { -f && $nb++ } $dir;
>
> With this code, I can scan 10,000 files in 15 seconds.
>
> I want to know if it exists a best way (= quicker) to do it ?

I don't know the answer for sure, but looking at File::Recurse, it
seems to be a bit bloated for what you want to do. In addition to
recursing through the directory structure, it also checks a hash of
options for each file found, and stores information about each entry
to be later returned from the recurse() function.

I would try just using the standard File::Find module and see if it's
any faster.

use File::Find;
my $nb = 0;
find($dir, sub { -f and $nb++ });

You may also wish to use the standard Benchmark module to actually
compare the two techniques.

Paul Lalli

Re: Optimized count of files in tree

am 12.08.2007 16:44:14 von patrick

Paul Lalli a =E9crit :
> On Aug 12, 9:46 am, Patrick wrote:
>> In an application I write in Perl, I must count the total number of
>> files (not directories) in a complete tree.
>>
>> What is the most efficient way to do it ?
>>
>> My current code is :
>>
>> use File::Recurse;
>> my $nb =3D 0;
>> recurse { -f && $nb++ } $dir;
>>
>> With this code, I can scan 10,000 files in 15 seconds.
>>
>> I want to know if it exists a best way (=3D quicker) to do it ?
>=20
> I don't know the answer for sure, but looking at File::Recurse, it
> seems to be a bit bloated for what you want to do. In addition to
> recursing through the directory structure, it also checks a hash of
> options for each file found, and stores information about each entry
> to be later returned from the recurse() function.
>=20
> I would try just using the standard File::Find module and see if it's
> any faster.
>=20
> use File::Find;
> my $nb =3D 0;
> find($dir, sub { -f and $nb++ });
>=20
> You may also wish to use the standard Benchmark module to actually
> compare the two techniques.
>=20
> Paul Lalli
>=20

Thanks for your answer but I have tried with File::Find : I got exactly=20
the same time for the same tree : 16 seconds for 10,000 files.

I have also tried with a "manual" solution :

sub getFileNb {
my $nb =3D 0;

return if ! opendir DIR,$_;
my @list =3D readdir DIR;
closedir DIR;

$nb +=3D grep { -f "$dir/$_" } @list;
my @subdirs =3D grep { /^[^\.]/ && -d "$dir/$_" } @list;

foreach ( @subdirs ) {
$nb +=3D &getFileNb("$dir/$_");
}

return $nb;
}

The result is about the same : 16 seconds for my 10,000 files.

I search for a really better performance ...

Patrick

Re: Optimized count of files in tree

am 12.08.2007 17:40:57 von Martijn Lievaart

On Sun, 12 Aug 2007 16:44:14 +0200, Patrick wrote:

> Thanks for your answer but I have tried with File::Find : I got exactly
> the same time for the same tree : 16 seconds for 10,000 files.
>
> I have also tried with a "manual" solution :
>
> sub getFileNb {
> my $nb = 0;
>
> return if ! opendir DIR,$_;
> my @list = readdir DIR;
> closedir DIR;
>
> $nb += grep { -f "$dir/$_" } @list;
> my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;
>
> foreach ( @subdirs ) {
> $nb += &getFileNb("$dir/$_");
> }
>
> return $nb;
> }
>
> The result is about the same : 16 seconds for my 10,000 files.
>
> I search for a really better performance ...

Looks like IO is the bottleneck. If you are on unix, do a

$ time find . -type f | wc -l

and see if that is significantly faster. If not, get a faster harddisk,
put the files on another (fast(er)) harddisk, switch to raid, add more
memory so you have more buffers, tume OS parameters, use another type of
filesystem. Mix and match to taste.

Note that

$ time yourscript.pl will give you insight in how much time is spend
waiting on I/O. Real-(user+sys) is the time spend waiting on other tasks
and IO. On a lightly loaded system, this will be mainly IO.

Look at this:

$ time find . -type f | wc -l
49956

real 0m14.594s
user 0m0.216s
sys 0m1.609s

The find command took only a fraction of the total time. One second was
spend in the kernel and 14 seconds was spend on IO.

HTH,
M4

Re: Optimized count of files in tree

am 12.08.2007 19:07:33 von Michele Dondi

On Sun, 12 Aug 2007 16:44:14 +0200, Patrick wrote:

>I have also tried with a "manual" solution :
>
>sub getFileNb {
> my $nb =3D 0;
>
> return if ! opendir DIR,$_;
> my @list =3D readdir DIR;
> closedir DIR;
>
> $nb +=3D grep { -f "$dir/$_" } @list;
> my @subdirs =3D grep { /^[^\.]/ && -d "$dir/$_" } @list;

You're stat()ing twice. It is time consuming, you'd better do it once.
>
> foreach ( @subdirs ) {
> $nb +=3D &getFileNb("$dir/$_");
> }

The &-form of sub call is obsolete and not likely to do what you mean.
Just avoid it.

>The result is about the same : 16 seconds for my 10,000 files.
>
>I search for a really better performance ...

I wouldn't go for a recursive solution then, but for an iterative one.

Michele
--
{$_=pack'B8'x25,unpack'A8'x32,$a^=sub{pop^pop}->(map substr
(($a||=join'',map--$|x$_,(unpack'w',unpack'u','G^ ..'KYU;*EVH[.FHF2W+#"\Z*5TI/ER 256),7,249);s/[^\w,]/ /g;$ \=/^J/?$/:"\r";print,redo}#JAPH,

Re: Optimized count of files in tree

am 13.08.2007 05:35:23 von merlyn

>>>>> "Patrick" == Patrick writes:

Patrick> In an application I write in Perl, I must count the total number of files (not
Patrick> directories) in a complete tree.

You haven't clarified whether you want to count a symlink pointing
at a file as a separate file or not. Your code counts it separately,
even if it's pointing at a file not in your starting tree.

Patrick> recurse { -f && $nb++ } $dir;

--
Randal L. Schwartz - Stonehenge Consulting Services, Inc. - +1 503 777 0095

Perl/Unix/security consulting, Technical writing, Comedy, etc. etc.
See PerlTraining.Stonehenge.com for onsite and open-enrollment Perl training!

--
Posted via a free Usenet account from http://www.teranews.com

Re: Optimized count of files in tree

am 13.08.2007 10:58:42 von Ingo Menger

On 12 Aug., 17:40, Martijn Lievaart wrote:
> On Sun, 12 Aug 2007 16:44:14 +0200, Patrick wrote:
> > Thanks for your answer but I have tried with File::Find : I got exactly
> > the same time for the same tree : 16 seconds for 10,000 files.
>
> > I have also tried with a "manual" solution :
>
> > sub getFileNb {
> > my $nb = 0;
>
> > return if ! opendir DIR,$_;
> > my @list = readdir DIR;
> > closedir DIR;
>
> > $nb += grep { -f "$dir/$_" } @list;
> > my @subdirs = grep { /^[^\.]/ && -d "$dir/$_" } @list;
>
> > foreach ( @subdirs ) {
> > $nb += &getFileNb("$dir/$_");
> > }
>
> > return $nb;
> > }
>
> > The result is about the same : 16 seconds for my 10,000 files.
>
> > I search for a really better performance ...
>
> Looks like IO is the bottleneck. If you are on unix, do a
>
> $ time find . -type f | wc -l
>
> and see if that is significantly faster. If not, get a faster harddisk,
> put the files on another (fast(er)) harddisk, switch to raid, add more
> memory so you have more buffers, tume OS parameters, use another type of
> filesystem. Mix and match to taste.

And, in addition, repeat the command to see whether it get's faster
the second time. Most likely, the directory blocks to be read will be
already in memory so the 2nd attempt may be that much faster.

Re: Optimized count of files in tree

am 13.08.2007 22:29:44 von Martijn Lievaart

On Mon, 13 Aug 2007 01:58:42 -0700, Ingo Menger wrote:

> And, in addition, repeat the command to see whether it get's faster the
> second time. Most likely, the directory blocks to be read will be
> already in memory so the 2nd attempt may be that much faster.

Good point! Very important to keep in mind.

M4