reading a directory, first files the newest ones

reading a directory, first files the newest ones

am 28.10.2007 02:10:04 von jordilin

When I read a huge directory with opendir,
opendir(DIR,"dirname");
my $file;
while($file=readdir(DIR))
whatever...
it loads the oldest ones first. I would like the newest files first,
instead of the oldest. Taking into account that I am only interested
in the newest files, this takes a lot of time, as the directory is
really huge. I am talking about thousands and thousands of files. I
need to process the files that are two hours old from now. I am not
interested in those older than two hours ago. I know that because I
check the modification time with stat.
any idea?
Thanks in advance

Re: reading a directory, first files the newest ones

am 28.10.2007 02:34:24 von xhoster

jordilin wrote:
> When I read a huge directory with opendir,
> opendir(DIR,"dirname");
> my $file;
> while($file=readdir(DIR))
> whatever...
> it loads the oldest ones first. I would like the newest files first,
> instead of the oldest.

That is completely up to your OS and your file system. Perl just provides
a fairly simple conduit for their behavior to reach you.

> Taking into account that I am only interested
> in the newest files, this takes a lot of time, as the directory is
> really huge. I am talking about thousands and thousands of files. I
> need to process the files that are two hours old from now. I am not
> interested in those older than two hours ago. I know that because I
> check the modification time with stat.
> any idea?

Come up with a better directory structure; one that doesn't involve keeping
thousands and thousands of file in one directory that has to be scanned
over and over again. Or make whatever puts the files into that directory
to make a log, or to also create a symbolic link in another directory
pointing to the new file, which link can be deleted after 2 hours or so.

Its possible that your OS and your file system provide other tools for
inspecting very large directories more efficiently, but I rather doubt it.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: reading a directory, first files the newest ones

am 28.10.2007 02:36:15 von Gunnar Hjalmarsson

jordilin wrote:
> When I read a huge directory with opendir,
> opendir(DIR,"dirname");
> my $file;
> while($file=readdir(DIR))
> whatever...
> it loads the oldest ones first. I would like the newest files first,
> instead of the oldest. Taking into account that I am only interested
> in the newest files, this takes a lot of time,

How much time is that?

> as the directory is
> really huge. I am talking about thousands and thousands of files. I
> need to process the files that are two hours old from now. I am not
> interested in those older than two hours ago.

You may want to use grep() to assign to an array the files you are
interested in.

my @files = grep -M $_ <= 2/24, readdir DIR;

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: reading a directory, first files the newest ones

am 28.10.2007 02:48:51 von krahnj

jordilin wrote:
>
> When I read a huge directory with opendir,
> opendir(DIR,"dirname");

You should *always* verify that the directory opened successfully:

opendir DIR, 'dirname' or die "Cannot open 'dirname' $!";

> my $file;
> while($file=readdir(DIR))
> whatever...
> it loads the oldest ones first.

No, it reads the file names in the order that they are stored in the
directory. It is just a coincidence that the older ones appear before
the newer ones. :-)

> I would like the newest files first, instead of the oldest.

Then you will have to sort them yourself.

perldoc -f sort

> Taking into account that I am only interested
> in the newest files, this takes a lot of time, as the directory is
> really huge. I am talking about thousands and thousands of files. I
> need to process the files that are two hours old from now. I am not
> interested in those older than two hours ago. I know that because I
> check the modification time with stat.
> any idea?

The only thing you can do is read all the file names in the directory
and stat() each one.



John
--
use Perl;
program
fulfillment

Re: reading a directory, first files the newest ones

am 28.10.2007 02:57:12 von jordilin

On Oct 28, 1:36 am, Gunnar Hjalmarsson wrote:
> jordilin wrote:
> > When I read a huge directory with opendir,
> > opendir(DIR,"dirname");
> > my $file;
> > while($file=readdir(DIR))
> > whatever...
> > it loads the oldest ones first. I would like the newest files first,
> > instead of the oldest. Taking into account that I am only interested
> > in the newest files, this takes a lot of time,
>
> How much time is that?
>
> > as the directory is
> > really huge. I am talking about thousands and thousands of files. I
> > need to process the files that are two hours old from now. I am not
> > interested in those older than two hours ago.
>
> You may want to use grep() to assign to an array the files you are
> interested in.
>
> my @files = grep -M $_ <= 2/24, readdir DIR;
>
> --
> Gunnar Hjalmarsson
> Email:http://www.gunnar.cc/cgi-bin/contact.pl

To grab the files that are from two hours ago till now, I have to
process each file to check the modification time. Obviously, if the
while checks the oldest files first, it can take more than 10 minutes
to arrive for those files I am interested in. This directory has a
huge amount of files.

Re: reading a directory, first files the newest ones

am 28.10.2007 03:02:34 von Gunnar Hjalmarsson

jordilin wrote:
> When I read a huge directory with opendir,
> opendir(DIR,"dirname");
> my $file;
> while($file=readdir(DIR))
> whatever...
> it loads the oldest ones first. I would like the newest files first,
> instead of the oldest. Taking into account that I am only interested
> in the newest files, this takes a lot of time, as the directory is
> really huge. I am talking about thousands and thousands of files. I
> need to process the files that are two hours old from now. I am not
> interested in those older than two hours ago.

Maybe you should let the system do the desired sorting. On *nix that
might be:

chomp( my @files = qx(ls -t $dir) );
foreach my $file (@files) {
last if -M "$dir/$file" > 2/24;
print "$file\n";
}

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: reading a directory, first files the newest ones

am 28.10.2007 03:04:46 von jurgenex

jordilin wrote:
> On Oct 28, 1:36 am, Gunnar Hjalmarsson wrote:
>> You may want to use grep() to assign to an array the files you are
>> interested in.
>>
>> my @files = grep -M $_ <= 2/24, readdir DIR;
>>
> To grab the files that are from two hours ago till now, I have to
> process each file to check the modification time.

Yes. That is what the -M does.

> Obviously, if the
> while checks the oldest files first, it can take more than 10 minutes
> to arrive for those files I am interested in.

That is exactly why Gunnar suggest not to use a while() loop but grep() in
the first place.

jue

Re: reading a directory, first files the newest ones

am 28.10.2007 03:18:41 von jordilin

On Oct 28, 2:02 am, Gunnar Hjalmarsson wrote:
> jordilin wrote:
> > When I read a huge directory with opendir,
> > opendir(DIR,"dirname");
> > my $file;
> > while($file=readdir(DIR))
> > whatever...
> > it loads the oldest ones first. I would like the newest files first,
> > instead of the oldest. Taking into account that I am only interested
> > in the newest files, this takes a lot of time, as the directory is
> > really huge. I am talking about thousands and thousands of files. I
> > need to process the files that are two hours old from now. I am not
> > interested in those older than two hours ago.
>
> Maybe you should let the system do the desired sorting. On *nix that
> might be:
>
> chomp( my @files = qx(ls -t $dir) );
> foreach my $file (@files) {
> last if -M "$dir/$file" > 2/24;
> print "$file\n";
> }
>
> --
> Gunnar Hjalmarsson
> Email:http://www.gunnar.cc/cgi-bin/contact.pl

With this code, and taking into account that the directory is huge,
memory usage would be a problem as we are going to use a huge array
@files, and the Unix server is a very important one. Don't know if
that could be achieved by means of a while. The real problem is having
to process many files before arriving to the interesting ones. The
solution would be reading the newest ones first. I think there is no
solution. We have, either to slurp all the files into an array (which
is going to take time and memory), or process the whole directory
through a while (one file at a time) till we get the proper files,
which in this case is going to take a lot of time as well.

Re: reading a directory, first files the newest ones

am 28.10.2007 03:31:35 von jordilin

On Oct 28, 2:04 am, "Jürgen Exner" wrote:
> jordilin wrote:
> > On Oct 28, 1:36 am, Gunnar Hjalmarsson wrote:
> >> You may want to use grep() to assign to an array the files you are
> >> interested in.
>
> >> my @files =3D grep -M $_ <=3D 2/24, readdir DIR;
>
> > To grab the files that are from two hours ago till now, I have to
> > process each file to check the modification time.
>
> Yes. That is what the -M does.
>
> > Obviously, if the
> > while checks the oldest files first, it can take more than 10 minutes
> > to arrive for those files I am interested in.
>
> That is exactly why Gunnar suggest not to use a while() loop but grep() in
> the first place.
>
> jue

Yeah, it seems that this would be a solution.

Re: reading a directory, first files the newest ones

am 28.10.2007 03:36:43 von Gunnar Hjalmarsson

jordilin wrote:
> On Oct 28, 2:02 am, Gunnar Hjalmarsson wrote:
>> Maybe you should let the system do the desired sorting. On *nix that
>> might be:
>>
>> chomp( my @files = qx(ls -t $dir) );
>> foreach my $file (@files) {
>> last if -M "$dir/$file" > 2/24;
>> print "$file\n";
>> }
>
> With this code, and taking into account that the directory is huge,

How big is "huge"?

> memory usage would be a problem as we are going to use a huge array
> @files, and the Unix server is a very important one. Don't know if
> that could be achieved by means of a while. The real problem is having
> to process many files before arriving to the interesting ones.

With the above suggestion you wouldn't _process_ any files but the
interesting ones; you'd just store their names in an array.

> The solution would be reading the newest ones first.

And that's what the -t option achieves...

> I think there is no solution.

??

> We have, either to slurp all the files into an array (which
> is going to take time and memory), or process the whole directory
> through a while (one file at a time) till we get the proper files,
> which in this case is going to take a lot of time as well.

Have you measured the time for various options? You may want to study
the Benchmark module.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: reading a directory, first files the newest ones

am 28.10.2007 03:47:38 von xhoster

Gunnar Hjalmarsson wrote:
> jordilin wrote:
> > On Oct 28, 2:02 am, Gunnar Hjalmarsson wrote:
> >> Maybe you should let the system do the desired sorting. On *nix that
> >> might be:
> >>
> >> chomp( my @files = qx(ls -t $dir) );
> >> foreach my $file (@files) {
> >> last if -M "$dir/$file" > 2/24;
> >> print "$file\n";
> >> }
> >
....
>
> > The solution would be reading the newest ones first.
>
> And that's what the -t option achieves...

No, the -t option tells ls to *present* the newest ones first, not to
read them first. To present them in that order, it first needs to read all
of the directory entries in whatever order the file system deigns to
deliver them, stat them all, and sort the results based on time. There is
no reason to think that ls is going to be meaningfully faster about this
than perl will.

Xho

--
-------------------- http://NewsReader.Com/ --------------------
The costs of publication of this article were defrayed in part by the
payment of page charges. This article must therefore be hereby marked
advertisement in accordance with 18 U.S.C. Section 1734 solely to indicate
this fact.

Re: reading a directory, first files the newest ones

am 28.10.2007 05:29:36 von Gunnar Hjalmarsson

xhoster@gmail.com wrote:
> Gunnar Hjalmarsson wrote:
>> jordilin wrote:
>>> On Oct 28, 2:02 am, Gunnar Hjalmarsson wrote:
>>>> Maybe you should let the system do the desired sorting. On *nix that
>>>> might be:
>>>>
>>>> chomp( my @files = qx(ls -t $dir) );
>>>> foreach my $file (@files) {
>>>> last if -M "$dir/$file" > 2/24;
>>>> print "$file\n";
>>>> }
> ...
>>> The solution would be reading the newest ones first.
>> And that's what the -t option achieves...
>
> No, the -t option tells ls to *present* the newest ones first, not to
> read them first. To present them in that order, it first needs to read all
> of the directory entries in whatever order the file system deigns to
> deliver them, stat them all, and sort the results based on time.

So far I agree, but ...

> There is no reason to think that ls is going to be meaningfully
> faster about this than perl will.

.... my benchmark (see below) indicates otherwise. The difference seems
to increase when the directory size increases.

$ cat sortdir.pl
use Benchmark 'cmpthese';
my $dir = '/usr/lib';
cmpthese -5, {
Linux => sub {
chomp( my @files = qx(ls -t $dir) );
},
Perl => sub {
chdir $dir;
opendir( my $DH, '.' );
my @files = map { $_->[0] }
sort { $a->[1] <=> $b->[1] } map { [ $_, -M ] }
grep substr($_, 0, 1) ne '.', readdir $DH;
},
};

$ perl sortdir.pl
Rate Perl Linux
Perl 174/s -- -75%
Linux 693/s 297% --

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: reading a directory, first files the newest ones

am 28.10.2007 10:17:03 von Juha Laiho

jordilin said:
>On Oct 28, 2:04 am, "Jürgen Exner" wrote:
>> jordilin wrote:
>> > On Oct 28, 1:36 am, Gunnar Hjalmarsson wrote:
>> >> You may want to use grep() to assign to an array the files you are
>> >> interested in.
>>
>> >> my @files = grep -M $_ <= 2/24, readdir DIR;
>
>Yeah, it seems that this would be a solution.

If you're concerned over the memory use, you wouldn't use this -- it'll
implicitly _first_ load all the directory entries into the memory, and
will start filtering the modification dates only after the directory
scan has been completed.

A small benchmark, to run stat() on all files in a directory (a directory
with one million files, names 000000 .. 999999):

$ time perl -e 'opendir($DH,"."); while ($f=readdir($DH)) { stat($f); }; closedir($DH);'

real 8m0.728s
user 0m2.367s
sys 0m21.523s

While running this, the memory usage (according to "top") was rather
constant 7MB.


Another way to do the same - which would behave like the "grep" example:

$ time perl -e 'opendir($DH,"."); foreach $f (readdir($DH)) { stat($f); }; closedir($DH);'

real 11m9.247s
user 0m1.957s
sys 0m21.953s

With this approach, the memory usage initially climbed to reach approx. 50MB,
and remained there until the completion.
--
Wolf a.k.a. Juha Laiho Espoo, Finland
(GC 3.0) GIT d- s+: a C++ ULSH++++$ P++@ L+++ E- W+$@ N++ !K w !O !M V
PS(+) PE Y+ PGP(+) t- 5 !X R !tv b+ !DI D G e+ h---- r+++ y++++
"...cancel my subscription to the resurrection!" (Jim Morrison)

Re: reading a directory, first files the newest ones

am 28.10.2007 12:03:30 von hjp-usenet2

On 2007-10-28 02:18, jordilin wrote:
> On Oct 28, 2:02 am, Gunnar Hjalmarsson wrote:
>> jordilin wrote:
>> > When I read a huge directory with opendir,
>> > opendir(DIR,"dirname");
>> > my $file;
>> > while($file=readdir(DIR))
>> > whatever...
>> > it loads the oldest ones first. I would like the newest files first,
>> > instead of the oldest. Taking into account that I am only interested
>> > in the newest files, this takes a lot of time, as the directory is
>> > really huge. I am talking about thousands and thousands of files. I
>> > need to process the files that are two hours old from now. I am not
>> > interested in those older than two hours ago.
>>
>> Maybe you should let the system do the desired sorting. On *nix that
>> might be:
>>
>> chomp( my @files = qx(ls -t $dir) );
>> foreach my $file (@files) {
>> last if -M "$dir/$file" > 2/24;
>> print "$file\n";
>> }
>
> With this code, and taking into account that the directory is huge,
> memory usage would be a problem as we are going to use a huge array
> @files, and the Unix server is a very important one.

That would be easily remedied by reading from a pipe. But I don't think
Gunnar's suggestion is really faster. It needs to stat read the
directory and stat all the files (which takes the same time as your
code), *then* it needs to sort them (which takes additional time),
*then* your code needs to read the sorted list.

> Don't know if that could be achieved by means of a while. The real
> problem is having to process many files before arriving to the
> interesting ones. The solution would be reading the newest ones first.
> I think there is no solution.

The solution, as Xho suggested, is to come up with a better directory
structure.

If you can't do that:

Is there any way you can deduce the age of the files from the file name?
If you can avoid stat'ing all these files it will be a lot faster. You
don't need the exact age - if you can determine, from the filename
alone, that a file is surely older than two hours you don't have to stat
it.

Do these files get written once, or are they constantly updated? If it's
the former, you can cache their last-modified-dates. Reading them from
a file or memcached is likely to be a lot faster than a stat.

hp


--
_ | Peter J. Holzer | It took a genius to create [TeX],
|_|_) | Sysadmin WSR | and it takes a genius to maintain it.
| | | hjp@hjp.at | That's not engineering, that's art.
__/ | http://www.hjp.at/ | -- David Kastrup in comp.text.tex

Re: reading a directory, first files the newest ones

am 28.10.2007 12:10:50 von hjp-usenet2

On 2007-10-28 04:29, Gunnar Hjalmarsson wrote:
> xhoster@gmail.com wrote:
>> There is no reason to think that ls is going to be meaningfully
>> faster about this than perl will.
>
> ... my benchmark (see below) indicates otherwise. The difference seems
> to increase when the directory size increases.
>
> $ cat sortdir.pl
> use Benchmark 'cmpthese';
> my $dir = '/usr/lib';
> cmpthese -5, {
> Linux => sub {
> chomp( my @files = qx(ls -t $dir) );
> },
> Perl => sub {
> chdir $dir;
> opendir( my $DH, '.' );
> my @files = map { $_->[0] }
> sort { $a->[1] <=> $b->[1] } map { [ $_, -M ] }
> grep substr($_, 0, 1) ne '.', readdir $DH;
> },
> };
>
> $ perl sortdir.pl
> Rate Perl Linux
> Perl 174/s -- -75%
> Linux 693/s 297% --
>

Your benchmark isn't valid: You are processing the complete directory
several hundred times per second, which indicates that it fits
completely into the buffer cache. After the first time you are measuring
mostly the processing time of ls and perl, not disk accesses.
Jordilin wrote that it takes about 10 minutes to process the directory
just once, which indicates that it either doesn't fit into the cache, or
that it is evicted from the cache between runs (which is quite likely on
a busy system), so he does have to access the disk for every file.

hp

--
_ | Peter J. Holzer | It took a genius to create [TeX],
|_|_) | Sysadmin WSR | and it takes a genius to maintain it.
| | | hjp@hjp.at | That's not engineering, that's art.
__/ | http://www.hjp.at/ | -- David Kastrup in comp.text.tex

Re: reading a directory, first files the newest ones

am 28.10.2007 17:58:46 von Mark Clements

jordilin wrote:
> When I read a huge directory with opendir,
> opendir(DIR,"dirname");
> my $file;
> while($file=readdir(DIR))
> whatever...
> it loads the oldest ones first. I would like the newest files first,
> instead of the oldest. Taking into account that I am only interested
> in the newest files, this takes a lot of time, as the directory is
> really huge. I am talking about thousands and thousands of files. I
> need to process the files that are two hours old from now. I am not
> interested in those older than two hours ago. I know that because I
> check the modification time with stat.
> any idea?
> Thanks in advance
>
Maybe you could do something with File::Monitor, although I know nothing
of its efficiency with large directories. FAM may also be worth a look.

Mark

Re: reading a directory, first files the newest ones

am 29.10.2007 06:31:51 von Gunnar Hjalmarsson

Peter J. Holzer wrote:
> On 2007-10-28 04:29, Gunnar Hjalmarsson wrote:
>> xhoster@gmail.com wrote:
>>> There is no reason to think that ls is going to be meaningfully
>>> faster about this than perl will.
>> ... my benchmark (see below) indicates otherwise. The difference seems
>> to increase when the directory size increases.
>>
>> $ cat sortdir.pl
>> use Benchmark 'cmpthese';
>> my $dir = '/usr/lib';
>> cmpthese -5, {
>> Linux => sub {
>> chomp( my @files = qx(ls -t $dir) );
>> },
>> Perl => sub {
>> chdir $dir;
>> opendir( my $DH, '.' );
>> my @files = map { $_->[0] }
>> sort { $a->[1] <=> $b->[1] } map { [ $_, -M ] }
>> grep substr($_, 0, 1) ne '.', readdir $DH;
>> },
>> };
>>
>> $ perl sortdir.pl
>> Rate Perl Linux
>> Perl 174/s -- -75%
>> Linux 693/s 297% --
>
> Your benchmark isn't valid: You are processing the complete directory
> several hundred times per second, which indicates that it fits
> completely into the buffer cache. After the first time you are measuring
> mostly the processing time of ls and perl, not disk accesses.

And that's what we were discussing, so I can't see that the benchmark
wouldn't be valid.

--
Gunnar Hjalmarsson
Email: http://www.gunnar.cc/cgi-bin/contact.pl

Re: reading a directory, first files the newest ones

am 30.10.2007 00:07:09 von hjp-usenet2

On 2007-10-29 05:31, Gunnar Hjalmarsson wrote:
> Peter J. Holzer wrote:
>> On 2007-10-28 04:29, Gunnar Hjalmarsson wrote:
>>> xhoster@gmail.com wrote:
>>>> There is no reason to think that ls is going to be meaningfully
>>>> faster about this than perl will.
>>> ... my benchmark (see below) indicates otherwise. The difference seems
>>> to increase when the directory size increases.
[...]
>>> $ perl sortdir.pl
>>> Rate Perl Linux
>>> Perl 174/s -- -75%
>>> Linux 693/s 297% --
>>
>> Your benchmark isn't valid: You are processing the complete directory
>> several hundred times per second, which indicates that it fits
>> completely into the buffer cache. After the first time you are measuring
>> mostly the processing time of ls and perl, not disk accesses.
>
> And that's what we were discussing,

If you have been discussing this, you totally missed the OPs problem.

The OP has to read a directory which - and I repeat this - takes more
than ten *minutes* to read. Your benchmar reads the directory in 5.7
(perl) or 1.4 (ls) *milliseconds*. That's a difference of more than four
orders of magnitude!

That tells us that the OP reads from a cold cache, while you read from a
hot cache: In you case CPU time is the dominant factor, so ls (being
written in C) will be faster. In the OPs case disk access time will be
the dominant factor, and any CPU usage advantage from ls will be
completely insignificant (and indeed ls may take more time because it
has to sort the files *after* having stat'ed them).

The only way to speed up this program is to reduce the number of disk
accesses. Xho already suggested way with the most potential: Change the
directory structure - having directories with hundredthousands or
millions of files is not a good idea, even of filesystems with tree- or
hash-structured directories. I asked if another one is feasible:
Estimating the age from the filename so you don't have to stat each one.

There is a third one, but I don't think this works in pure perl, because
you need access to information readdir doesn't deliver: Read the
directory, sort by inode number, then stat the files in order. This
doesn't reduce the number of stat calls, but it reduces the number of
disk seeks which may provide a major speedup (there's a patch for mutt
which uses this technique for maildirs, and it's really a lot faster for
large mailboxes).

hp


--
_ | Peter J. Holzer | It took a genius to create [TeX],
|_|_) | Sysadmin WSR | and it takes a genius to maintain it.
| | | hjp@hjp.at | That's not engineering, that's art.
__/ | http://www.hjp.at/ | -- David Kastrup in comp.text.tex