Storing file information in memory

am 15.11.2007 22:35:11 von Peter Paul Jansen

I'm writing a command line utility to move some files. I'm dealing with
thousands of files and I was wondering if anyone had any suggestions.

This is what I have currently:

$arrayVirtualFile =
array( 'filename'=>'filename',
'basename'=>'filename.ext',
'extension'=>'ext',
'size'=>0,
'dirname'=>'',
'uxtimestamp'=>'');

I then loop through a directory and for each file I populate the $arrayVirtualFile
and add it to $arrayOfVirtualFiles.
A directory of ~2500 files takes up about ~1.7 MB of memory when I run
the script.
Anyone have any suggestions as to how to take up less space?

Thanks!!

Posted by NewsLook (Trial Licence) from http://www.ghytred.com/NewsLook/about.aspx

Re: Storing file information in memory

am 15.11.2007 22:49:42 von Steve

"deciacco" wrote in message
news:c6318fb5a59c46bf8a1e094964de9d6e@ghytred.com...
> I'm writing a command line utility to move some files. I'm dealing with
> thousands of files and I was wondering if anyone had any suggestions.
>
> This is what I have currently:
>
> $arrayVirtualFile =
> array( 'filename'=>'filename',
> 'basename'=>'filename.ext',
> 'extension'=>'ext',
> 'size'=>0,
> 'dirname'=>'',
> 'uxtimestamp'=>'');
>
> I then loop through a directory and for each file I populate the
> $arrayVirtualFile
> and add it to $arrayOfVirtualFiles.
> A directory of ~2500 files takes up about ~1.7 MB of memory when I run
> the script.
> Anyone have any suggestions as to how to take up less space?

well, that all depends what you're doing with that information. plus, your
array structure is a must point. why not just store the file names in an
array. when you need all that info, just use the pathinfo() function. with
just that, so far you have the file name, basename, extension, path...all
you need now is to call fstat() to get the size and the touch time. that
should knock down your memory consumption monumentally. plus, using pathinfo
and fstat will give you a bunch more information that your current
structure.

so, store minimally what you need. then use functions to get the info when
you need it. but again, you should really define what you're doing this all
for...as in, once you have that info, what are you doing?

Re: Storing file information in memory

am 15.11.2007 23:07:13 von Peter Paul Jansen

thanks for the reply steve...

basically, i want to collect the file information into memory so that I can
then do analysis, like compare file times and sizes. it's much faster to do
this in memory than to do it from disk. should have mentioned this earlier
as you said...

"Steve" wrote in message
news:W93%i.227$dY3.203@newsfe02.lga...
>
> "deciacco" wrote in message
> news:c6318fb5a59c46bf8a1e094964de9d6e@ghytred.com...
>> I'm writing a command line utility to move some files. I'm dealing with
>> thousands of files and I was wondering if anyone had any suggestions.
>>
>> This is what I have currently:
>>
>> $arrayVirtualFile =
>> array( 'filename'=>'filename',
>> 'basename'=>'filename.ext',
>> 'extension'=>'ext',
>> 'size'=>0,
>> 'dirname'=>'',
>> 'uxtimestamp'=>'');
>>
>> I then loop through a directory and for each file I populate the
>> $arrayVirtualFile
>> and add it to $arrayOfVirtualFiles.
>> A directory of ~2500 files takes up about ~1.7 MB of memory when I run
>> the script.
>> Anyone have any suggestions as to how to take up less space?
>
> well, that all depends what you're doing with that information. plus, your
> array structure is a must point. why not just store the file names in an
> array. when you need all that info, just use the pathinfo() function. with
> just that, so far you have the file name, basename, extension, path...all
> you need now is to call fstat() to get the size and the touch time. that
> should knock down your memory consumption monumentally. plus, using
> pathinfo and fstat will give you a bunch more information that your
> current structure.
>
> so, store minimally what you need. then use functions to get the info when
> you need it. but again, you should really define what you're doing this
> all for...as in, once you have that info, what are you doing?
>

Re: Storing file information in memory

am 16.11.2007 11:40:24 von Courtney

deciacco wrote:
> thanks for the reply steve...
>
> basically, i want to collect the file information into memory so that I can
> then do analysis, like compare file times and sizes. it's much faster to do
> this in memory than to do it from disk. should have mentioned this earlier
> as you said...
>

Why do you care how much memory it takes?

1.7MB is not very much.

Re: Storing file information in memory

am 16.11.2007 15:00:37 von Steve

"The Natural Philosopher" wrote in message
news:1195209624.8024.5@proxy00.news.clara.net...
> deciacco wrote:
>> thanks for the reply steve...
>>
>> basically, i want to collect the file information into memory so that I
>> can then do analysis, like compare file times and sizes. it's much faster
>> to do this in memory than to do it from disk. should have mentioned this
>> earlier as you said...
>>
>
> Why do you care how much memory it takes?
>
> 1.7MB is not very much.

why do you care if he cares?

solve the problem!

Re: Storing file information in memory

am 16.11.2007 15:43:29 von Peter Paul Jansen

These days memory is not an issue, but that does not mean we shouldn't write
good, efficient code that utilizes memory well.

While 1.7MB is not much, that is what is generated when I look at ~2500
files. I have approximately 175000 files to look at and my script uses up
about 130MB. I was simply wondering if someone out there with more
experience, had a better way of doing this that would utilize less memory.

"The Natural Philosopher" wrote in message
news:1195209624.8024.5@proxy00.news.clara.net...
> deciacco wrote:
>> thanks for the reply steve...
>>
>> basically, i want to collect the file information into memory so that I
>> can then do analysis, like compare file times and sizes. it's much faster
>> to do this in memory than to do it from disk. should have mentioned this
>> earlier as you said...
>>
>
> Why do you care how much memory it takes?
>
> 1.7MB is not very much.

Re: Storing file information in memory

am 16.11.2007 16:12:16 von Jerry Stuckle

deciacco wrote:
> "The Natural Philosopher" wrote in message
> news:1195209624.8024.5@proxy00.news.clara.net...
>> deciacco wrote:
>>> thanks for the reply steve...
>>>
>>> basically, i want to collect the file information into memory so that I
>>> can then do analysis, like compare file times and sizes. it's much faster
>>> to do this in memory than to do it from disk. should have mentioned this
>>> earlier as you said...
>>>
>> Why do you care how much memory it takes?
>>
>> 1.7MB is not very much.
>
> These days memory is not an issue, but that does not mean we shouldn't
> write good, efficient code that utilizes memory well.
>

There is also something known as "premature optimization".

> While 1.7MB is not much, that is what is generated when I look at
> ~2500 files. I have approximately 175000 files to look at and my
> script uses up about 130MB. I was simply wondering if someone out
> there with more experience, had a better way of doing this that would
> utilize less memory.
>

(Top posting fixed)

How are you figuring your 1.7Mb? If you're just looking at how much
memory is being used by the process, for instance, there will be a lot
of other things in there, also - like your code.

1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
seems rather a bit large to me. But it also depends on just how much
you're storing in the array (i.e. how long are your path names).

I also wonder why you feel a need to store so much info in memory, but
I'm sure you have a good reason.

P.S. Please don't top post. Thanks.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: Storing file information in memory

am 16.11.2007 16:50:18 von Peter Paul Jansen

"Jerry Stuckle" wrote in message
news:OaSdnU7vl-_ELqDanZ2dnUVZ_hadnZ2d@comcast.com...
> deciacco wrote:
>> "The Natural Philosopher" wrote in message
>> news:1195209624.8024.5@proxy00.news.clara.net...
>>> deciacco wrote:
>>>> thanks for the reply steve...
>>>> basically, i want to collect the file information into memory so
>>>> that I can then do analysis, like compare file times and sizes.
>>>> it's much faster to do this in memory than to do it from disk.
>>>> should have mentioned this earlier as you said...
>>> Why do you care how much memory it takes?
>>> 1.7MB is not very much.
>> These days memory is not an issue, but that does not mean we shouldn't
>> write good, efficient code that utilizes memory well.
> There is also something known as "premature optimization".
>> While 1.7MB is not much, that is what is generated when I look at
>> ~2500 files. I have approximately 175000 files to look at and my
>> script uses up about 130MB. I was simply wondering if someone out
>> there with more experience, had a better way of doing this that would
>> utilize less memory.
> (Top posting fixed)
> How are you figuring your 1.7Mb? If you're just looking at how much
> memory is being used by the process, for instance, there will be a lot of
> other things in there, also - like your code.
> 1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
> seems rather a bit large to me. But it also depends on just how much
> you're storing in the array (i.e. how long are your path names).
> I also wonder why you feel a need to store so much info in memory, but I'm
> sure you have a good reason.
> P.S. Please don't top post. Thanks.

Jerry...

I use Outlook Express and it does top-posting by default. Didn't realize
top-posting was bad.

To answer your questions:

"Premature Optimization"
I first noticed this problem in my first program. It was running much slower
and taking up 5 times as much memory. I realized I needed to rethink my
code.

"Figuring Memory Use"
To get the amount of memory used, I take a reading with memory_get_usage()
at the start of the code in question and then take another reading at the
end of the snippet. I then take the difference and that should give me a
good idea of the amount of memory my code is utilizing.

"Feel the Need"
The first post shows you an array of the type of data I store. This array
gets created for each file and added as an item to another array. In other
words, an array of arrays. As I mentioned in a fallow-up posting, the reason
I'm doing this is because I want to do some analysis of file information,
like comparing file times and sizes from two seperate directories. This is
much faster in memory than on disk.

Re: Storing file information in memory

am 16.11.2007 17:36:02 von Steve

"deciacco" wrote in message
news:XrmdnVeoe9faJqDanZ2dnUVZ_vqpnZ2d@giganews.com...
> "Jerry Stuckle" wrote in message
> news:OaSdnU7vl-_ELqDanZ2dnUVZ_hadnZ2d@comcast.com...
>> deciacco wrote:
>>> "The Natural Philosopher" wrote in message
>>> news:1195209624.8024.5@proxy00.news.clara.net...
>>>> deciacco wrote:
>>>>> thanks for the reply steve...
>>>>> basically, i want to collect the file information into memory so
>>>>> that I can then do analysis, like compare file times and sizes.
>>>>> it's much faster to do this in memory than to do it from disk.
>>>>> should have mentioned this earlier as you said...
>>>> Why do you care how much memory it takes?
>>>> 1.7MB is not very much.
>>> These days memory is not an issue, but that does not mean we shouldn't
>>> write good, efficient code that utilizes memory well.
>> There is also something known as "premature optimization".
>>> While 1.7MB is not much, that is what is generated when I look at
>>> ~2500 files. I have approximately 175000 files to look at and my
>>> script uses up about 130MB. I was simply wondering if someone out
>>> there with more experience, had a better way of doing this that would
>>> utilize less memory.
>> (Top posting fixed)
>> How are you figuring your 1.7Mb? If you're just looking at how much
>> memory is being used by the process, for instance, there will be a lot of
>> other things in there, also - like your code.
>> 1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
>> seems rather a bit large to me. But it also depends on just how much
>> you're storing in the array (i.e. how long are your path names).
>> I also wonder why you feel a need to store so much info in memory, but
>> I'm sure you have a good reason.
>> P.S. Please don't top post. Thanks.
>
> Jerry...
>
> I use Outlook Express and it does top-posting by default. Didn't realize
> top-posting was bad.

i use oe too. just hit ctrl+end immediately after hitting 'reply group'. a
usenet thread isn't like an email conversation where both parties already
know what was said in the previous coorespondence. top posting in usenet
forces *everyone* to start reading a post from the bottom up. this is
particularly painful when in-line responses are made...you have to not only
read from the bottom up, but find the start of a reponse, read down to see
the in-line response(s), then scroll back up past the start of that post
again.

tons of other reasons. we just ask that you know and try to follow as best
you can what usenet considers uniform/standard netiquette.

> To answer your questions:

> "Feel the Need"
> The first post shows you an array of the type of data I store. This array
> gets created for each file and added as an item to another array. In other
> words, an array of arrays. As I mentioned in a fallow-up posting, the
> reason I'm doing this is because I want to do some analysis of file
> information, like comparing file times and sizes from two seperate
> directories. This is much faster in memory than on disk.

ok, for the comparisons...consider speed and memory consumption. if you were
to get a list of file names, your memory consumption would be at its bare
minimum (almost). when doing the comparison, you can vastly improve your
performance *and* maintainability by iterating through the files, getting
the file info, putting that info into a db, and then run queries against the
table. the db will beat your php comparison algorythms any day of the week.
plus, sql is formalized...so everyone will understand how you are making
your comparisons.

the only way to get lower memory consumption would be to, during the process
of listing files, DON'T store the file but immediately put all the
information into the db at that point. that will be the theoretical best
performance and memory utilization combination there can be.

btw, i posted this function in another group and someone asked today what
the hell it does. since it directly relates to what you're doing AND uses
pathinfo and fstat, which i mentioned to you briefly in this thread before,
i thought i'd post this example to help:

==============

function listFiles($path = '.', $extension = array(), $combine = false)
{
$wd = getcwd();
$path .= substr($path, -1) != '/' ? '/' : '';
if (!chdir($path)){ return array(); }
if (!$extension){ $extension = array('*'); }
if (!is_array($extension)){ $extension = array($extension); }
$extensions = '*.{' . implode(',', $extension) . '}';
$files = glob($extensions, GLOB_BRACE);
chdir($wd);
if (!$files){ return array(); }
$list = array();
$path = $combine ? $path : '';
foreach ($files as $file)
{
$list[] = $path . $file;
}
return $list;
}
$files = listFiles('c:/inetpub/wwwroot/images', 'jpg', true);
$images = array();
foreach ($files as $file)
{
$fileInfo = pathinfo($file);
$handle = fopen($file, 'r');
$fileInfo = array_merge($fileInfo, fstat($handle));
fclose($handle);
for ($i = 0; $i < 13; $i++){ unset($fileInfo[$i]); }
echo '

' . print_r($fileInfo, true) . '

';
}
?>

Re: Storing file information in memory

am 16.11.2007 19:29:56 von Jerry Stuckle

deciacco wrote:
> "Jerry Stuckle" wrote in message
> news:OaSdnU7vl-_ELqDanZ2dnUVZ_hadnZ2d@comcast.com...
>> deciacco wrote:
>>> "The Natural Philosopher" wrote in message
>>> news:1195209624.8024.5@proxy00.news.clara.net...
>>>> deciacco wrote:
>>>>> thanks for the reply steve...
>>>>> basically, i want to collect the file information into memory so
>>>>> that I can then do analysis, like compare file times and sizes.
>>>>> it's much faster to do this in memory than to do it from disk.
>>>>> should have mentioned this earlier as you said...
>>>> Why do you care how much memory it takes?
>>>> 1.7MB is not very much.
>>> These days memory is not an issue, but that does not mean we shouldn't
>>> write good, efficient code that utilizes memory well.
>> There is also something known as "premature optimization".
>>> While 1.7MB is not much, that is what is generated when I look at
>>> ~2500 files. I have approximately 175000 files to look at and my
>>> script uses up about 130MB. I was simply wondering if someone out
>>> there with more experience, had a better way of doing this that would
>>> utilize less memory.
>> (Top posting fixed)
>> How are you figuring your 1.7Mb? If you're just looking at how much
>> memory is being used by the process, for instance, there will be a lot of
>> other things in there, also - like your code.
>> 1.7Mb for 2500 files comes out to just under 700 bytes per entry, which
>> seems rather a bit large to me. But it also depends on just how much
>> you're storing in the array (i.e. how long are your path names).
>> I also wonder why you feel a need to store so much info in memory, but I'm
>> sure you have a good reason.
>> P.S. Please don't top post. Thanks.
>
> Jerry...
>
> I use Outlook Express and it does top-posting by default. Didn't realize
> top-posting was bad.
>

No problem. Recommendation - get Thunderbird. Much superior, and free :-)

> To answer your questions:
>
> "Premature Optimization"
> I first noticed this problem in my first program. It was running much slower
> and taking up 5 times as much memory. I realized I needed to rethink my
> code.
>

OK, so you've identified a problem. Good.

> "Figuring Memory Use"
> To get the amount of memory used, I take a reading with memory_get_usage()
> at the start of the code in question and then take another reading at the
> end of the snippet. I then take the difference and that should give me a
> good idea of the amount of memory my code is utilizing.
>

At last - someone who knows how to figure memory usage correctly! :-)

But I'm still confused why it would take almost 700 bytes per entry on
average. The array overhead shouldn't be *that* bad.

> "Feel the Need"
> The first post shows you an array of the type of data I store. This array
> gets created for each file and added as an item to another array. In other
> words, an array of arrays. As I mentioned in a fallow-up posting, the reason
> I'm doing this is because I want to do some analysis of file information,
> like comparing file times and sizes from two seperate directories. This is
> much faster in memory than on disk.
>
>

Yes, it would be faster to do the comparisons in memory. However, you
also need to consider the amount of time it takes to create your arrays.
It isn't minor compared to some other operations.

When you're searching for files on the disk, as you get the file info,
the first one will take a while because the system has to (probably)
fetch the info from disk. But this caches several file entries, so the
next few will be relatively quick, until the system has to hit the disk
again (a big enough cache and that might never happen).

However, at the same time, if you just read one file from each directory
(assuming you're comparing the same file names) and compare them, then
go to the next file, the cache will still probably be valid, unless your
system is heavily loaded with high CPU and disk utilization. So in that
case your current algorithm probably will be slower than reading one at
a time and comparing.

Of course, if you're doing multiple compares, i.a. 'a' from the first
directory with 'x', 'y' and 'z' from the second directory, this wouldn't
be the case.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: Storing file information in memory

am 16.11.2007 23:57:11 von Peter Paul Jansen

Jerry Stuckle wrote:
> deciacco wrote:
>> "Jerry Stuckle" wrote in message
>> news:OaSdnU7vl-_ELqDanZ2dnUVZ_hadnZ2d@comcast.com...
>>> deciacco wrote:
>>>> "The Natural Philosopher" wrote in message
>>>> news:1195209624.8024.5@proxy00.news.clara.net...
>>>>> deciacco wrote:
>>>>>> thanks for the reply steve...
>>>>>> basically, i want to collect the file information into memory so
>>>>>> that I can then do analysis, like compare file times and sizes.
>>>>>> it's much faster to do this in memory than to do it from disk.
>>>>>> should have mentioned this earlier as you said...
>>>>> Why do you care how much memory it takes?
>>>>> 1.7MB is not very much.
>>>> These days memory is not an issue, but that does not mean we shouldn't
>>>> write good, efficient code that utilizes memory well.
>>> There is also something known as "premature optimization".
>>>> While 1.7MB is not much, that is what is generated when I look at
>>>> ~2500 files. I have approximately 175000 files to look at and my
>>>> script uses up about 130MB. I was simply wondering if someone out
>>>> there with more experience, had a better way of doing this that would
>>>> utilize less memory.
>>> (Top posting fixed)
>>> How are you figuring your 1.7Mb? If you're just looking at how much
>>> memory is being used by the process, for instance, there will be a
>>> lot of other things in there, also - like your code.
>>> 1.7Mb for 2500 files comes out to just under 700 bytes per entry,
>>> which seems rather a bit large to me. But it also depends on just
>>> how much you're storing in the array (i.e. how long are your path
>>> names).
>>> I also wonder why you feel a need to store so much info in memory,
>>> but I'm sure you have a good reason.
>>> P.S. Please don't top post. Thanks.
>>
>> Jerry...
>>
>> I use Outlook Express and it does top-posting by default. Didn't
>> realize top-posting was bad.
>>
>
> No problem. Recommendation - get Thunderbird. Much superior, and free :-)
Coming to you from Thunderbird. I had given up on it since there was
some talk to discontinue it/put it on the back burner at Mozilla. I got
it installed and configured as a newsreader only. Pretty cool!

>
>> To answer your questions:
>>
>> "Premature Optimization"
>> I first noticed this problem in my first program. It was running much
>> slower and taking up 5 times as much memory. I realized I needed to
>> rethink my code.
>>
>
> OK, so you've identified a problem. Good.
Yeah, was a real eye open too. I figured I didn't need to worry. It's
PHP after all, right!

>
>> "Figuring Memory Use"
>> To get the amount of memory used, I take a reading with
>> memory_get_usage() at the start of the code in question and then take
>> another reading at the end of the snippet. I then take the difference
>> and that should give me a good idea of the amount of memory my code is
>> utilizing.
>>
>
> At last - someone who knows how to figure memory usage correctly! :-)
Thank you!

>
> But I'm still confused why it would take almost 700 bytes per entry on
> average. The array overhead shouldn't be *that* bad.
Hmm.. I will have to do some digging and try to pay closer attention.
Right now the focus was to simply get it down to a more reasonable
amount. The current solution is much faster, in the few seconds instead
of few minutes, and the memory use is much lower. If I stick in the
100,000 to 200,000 file range I will be more than fine.

>
>
>> "Feel the Need"
>> The first post shows you an array of the type of data I store. This
>> array gets created for each file and added as an item to another
>> array. In other words, an array of arrays. As I mentioned in a
>> fallow-up posting, the reason I'm doing this is because I want to do
>> some analysis of file information, like comparing file times and sizes
>> from two seperate directories. This is much faster in memory than on
>> disk.
>>
>>
>
> Yes, it would be faster to do the comparisons in memory. However, you
> also need to consider the amount of time it takes to create your arrays.
> It isn't minor compared to some other operations.
>
> When you're searching for files on the disk, as you get the file info,
> the first one will take a while because the system has to (probably)
> fetch the info from disk. But this caches several file entries, so the
> next few will be relatively quick, until the system has to hit the disk
> again (a big enough cache and that might never happen).
>
> However, at the same time, if you just read one file from each directory
> (assuming you're comparing the same file names) and compare them, then
> go to the next file, the cache will still probably be valid, unless your
> system is heavily loaded with high CPU and disk utilization. So in that
> case your current algorithm probably will be slower than reading one at
> a time and comparing.
>
> Of course, if you're doing multiple compares, i.a. 'a' from the first
> directory with 'x', 'y' and 'z' from the second directory, this wouldn't
> be the case.
>
>

Thanks to you and everyone else for the input on this post.