which is the better option for directory hashing to store large number of image files?

am 17.09.2007 09:09:14 von theCancerus

Hi All,

I am not sure if this is the right place to ask this question but i am
very sure you may have faced this problem, i have already found some
post related to this but not the answer i am looking for.

My problem is that i have to upload images and store them. I am using
filesystem for that.

setup is something like this, their will be items/groups/user each can
have upto 6 images which needs to be scaled to 4 different sizes ie
every item can have upto 24 images of varying sizes.

now the standard way of storing these files would be to store them in
subdirectories based on some hash.

my partial solution is to split the four types of files into four
fixed base folders for each dimension,

since filename is in format "YmdHis" i decided to use directory
structure as Y/m/d/.
but i realize that even this could be inefficient.

so now i am thinking about going one more level by creating Y/m/d/H/i/
directory structure.

now my question is how to go about creating subdirectories below base
folders, will my scheme hold or should i use md5 hash as suggested by
others, over the filename and then take 2-3 characters and create one
or two level of directory structure and then store the files?

Regards,
Amit

Re: which is the better option for directory hashing to store largenumber of image files?

am 17.09.2007 13:49:35 von Jerry Stuckle

theCancerus wrote:
> Hi All,
>
> I am not sure if this is the right place to ask this question but i am
> very sure you may have faced this problem, i have already found some
> post related to this but not the answer i am looking for.
>
> My problem is that i have to upload images and store them. I am using
> filesystem for that.
>
> setup is something like this, their will be items/groups/user each can
> have upto 6 images which needs to be scaled to 4 different sizes ie
> every item can have upto 24 images of varying sizes.
>
> now the standard way of storing these files would be to store them in
> subdirectories based on some hash.
>
> my partial solution is to split the four types of files into four
> fixed base folders for each dimension,
>
> since filename is in format "YmdHis" i decided to use directory
> structure as Y/m/d/.
> but i realize that even this could be inefficient.
>
> so now i am thinking about going one more level by creating Y/m/d/H/i/
> directory structure.
>
> now my question is how to go about creating subdirectories below base
> folders, will my scheme hold or should i use md5 hash as suggested by
> others, over the filename and then take 2-3 characters and create one
> or two level of directory structure and then store the files?
>
> Regards,
> Amit
>

I use databases for this.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: which is the better option for directory hashing to store large number of image files?

am 17.09.2007 15:26:05 von NoDude

I personally use something like /images/front/controller/row_id/ -
that way I can only store the name of the image.

On Sep 17, 2:49 pm, Jerry Stuckle wrote:
> theCancerus wrote:
> > Hi All,
>
> > I am not sure if this is the right place to ask this question but i am
> > very sure you may have faced this problem, i have already found some
> > post related to this but not the answer i am looking for.
>
> > My problem is that i have to upload images and store them. I am using
> > filesystem for that.
>
> > setup is something like this, their will be items/groups/user each can
> > have upto 6 images which needs to be scaled to 4 different sizes ie
> > every item can have upto 24 images of varying sizes.
>
> > now the standard way of storing these files would be to store them in
> > subdirectories based on some hash.
>
> > my partial solution is to split the four types of files into four
> > fixed base folders for each dimension,
>
> > since filename is in format "YmdHis" i decided to use directory
> > structure as Y/m/d/.
> > but i realize that even this could be inefficient.
>
> > so now i am thinking about going one more level by creating Y/m/d/H/i/
> > directory structure.
>
> > now my question is how to go about creating subdirectories below base
> > folders, will my scheme hold or should i use md5 hash as suggested by
> > others, over the filename and then take 2-3 characters and create one
> > or two level of directory structure and then store the files?
>
> > Regards,
> > Amit
>
> I use databases for this.
>
> --
> ==================
> Remove the "x" from my email address
> Jerry Stuckle
> JDS Computer Training Corp.
> jstuck...@attglobal.net
> ==================

Re: which is the better option for directory hashing to store large number of image files?

am 17.09.2007 16:15:48 von Shelly

I didn't understand what you were asking at first, but I think I do now.
What I would do in your is to use a combination of file structure and
database entries. The first question you need to ask yourself is how will
you typically be accessing these files. For example, I store a list of
images for a given order. In that case, I create a folder under photos with
the name of the order number. I then place the images in that folder -- but
then I only plan to access it via order number. That is simple. What you
see to need is a multiple method of finding the files. They might be
between certain dates, certain owners, certain names, etc. In that case you
would want to put all those as fields in a database table and have the full
file name (including path) in another field. You would search the database
however you wish and that would yield [near] immediate access to the file
location.

Moral: Programming, as well as life, is not always an either-or. Sometimes
a compromise/hybrid is the best solution.

--
Shelly

"NoDude" wrote in message
news:1190035565.874728.46600@50g2000hsm.googlegroups.com...
>I personally use something like /images/front/controller/row_id/ -
> that way I can only store the name of the image.
>
> On Sep 17, 2:49 pm, Jerry Stuckle wrote:
>> theCancerus wrote:
>> > Hi All,
>>
>> > I am not sure if this is the right place to ask this question but i am
>> > very sure you may have faced this problem, i have already found some
>> > post related to this but not the answer i am looking for.
>>
>> > My problem is that i have to upload images and store them. I am using
>> > filesystem for that.
>>
>> > setup is something like this, their will be items/groups/user each can
>> > have upto 6 images which needs to be scaled to 4 different sizes ie
>> > every item can have upto 24 images of varying sizes.
>>
>> > now the standard way of storing these files would be to store them in
>> > subdirectories based on some hash.
>>
>> > my partial solution is to split the four types of files into four
>> > fixed base folders for each dimension,
>>
>> > since filename is in format "YmdHis" i decided to use directory
>> > structure as Y/m/d/.
>> > but i realize that even this could be inefficient.
>>
>> > so now i am thinking about going one more level by creating Y/m/d/H/i/
>> > directory structure.
>>
>> > now my question is how to go about creating subdirectories below base
>> > folders, will my scheme hold or should i use md5 hash as suggested by
>> > others, over the filename and then take 2-3 characters and create one
>> > or two level of directory structure and then store the files?
>>
>> > Regards,
>> > Amit
>>
>> I use databases for this.
>>
>> --
>> ==================
>> Remove the "x" from my email address
>> Jerry Stuckle
>> JDS Computer Training Corp.
>> jstuck...@attglobal.net
>> ==================
>
>

Re: which is the better option for directory hashing to store large number of image files?

am 17.09.2007 16:46:36 von Steve

> Moral: Programming, as well as life, is not always an either-or.
> Sometimes a compromise/hybrid is the best solution.
>
> --
> Shelly

ahhh, but shelly, the thing i like most is that in programming, it is always
either/or: on/off. to say otherwise is to not know programming. the same
holds true for life. you either do or do not. any notions about the nobility
or superiority of human action in his contemplation of life are simply
false, save the fact that there is none of either. do or do not is all that
remains and that directly linked to his own survivability - as is the
impetous of all animals.

compromise. chuckle.

Re: which is the better option for directory hashing to store large number of image files?

am 17.09.2007 19:02:57 von Shelly

"Steve" wrote in message
news:3rwHi.805$3C.788@newsfe05.lga...
>> Moral: Programming, as well as life, is not always an either-or.
>> Sometimes a compromise/hybrid is the best solution.
>>
>> --
>> Shelly
>
> ahhh, but shelly, the thing i like most is that in programming, it is
> always either/or: on/off. to say otherwise is to not know programming. the
> same holds true for life. you either do or do not. any notions about the
> nobility or superiority of human action in his contemplation of life are
> simply false, save the fact that there is none of either. do or do not is
> all that remains and that directly linked to his own survivability - as is
> the impetous of all animals.
>
> compromise. chuckle.

So, I take it that if you fed a meal which is a wonderfully prepared, 10
pound, filet mignon you either (a) eat all of it or (b) eat none of it?

or,

If you are faced with a court appearance for excessive speeding in your car
you should either be acquitted or should get the death sentence?

On one project about 25 years ago I needed to modify a very large
application that was written in Fortran. I needed dynamic allocation.
According to you, I should have been faced with two choices. One was to
emulate dynamic allocation by setting aside a large part of memory and doing
my own allocation from that memory heap. A second would have been to
totally rewrite that entire (largggggeeeee) application in C. I chose a
"compromise". I wrote a small module in C and used that in conjunction with
the rest of the Fortran code.

The point here is that there are two extremes in handling his situation.
Either avoid a database and just use the file system, or avoid the file
system and put all of the contents of the file into a blob field in the
database. Often, the better way is to use the database as a rapid search
engine for a file in the file system.

I guess you aren't married? I have been for over four decades. Believe me,
"all or nothing" just doesn't work. Even with a swich for the lights you
can always add a dimmer.

By the way, I have been programming four over forty years. We are not
talking ones and zeros, true or false, here. We are talking design
philosophy -- and that if usually a compromise among various alternatives to
achieve the most efficient results in the shortest time for the least cost.

Shelly

OT, but fun ;^)

am 17.09.2007 19:41:18 von Steve

"Shelly" wrote in message
news:13etcs11ug57rb6@corp.supernews.com...
>
> "Steve" wrote in message
> news:3rwHi.805$3C.788@newsfe05.lga...
>>> Moral: Programming, as well as life, is not always an either-or.
>>> Sometimes a compromise/hybrid is the best solution.
>>>
>>> --
>>> Shelly
>>
>> ahhh, but shelly, the thing i like most is that in programming, it is
>> always either/or: on/off. to say otherwise is to not know programming.
>> the same holds true for life. you either do or do not. any notions about
>> the nobility or superiority of human action in his contemplation of life
>> are simply false, save the fact that there is none of either. do or do
>> not is all that remains and that directly linked to his own
>> survivability - as is the impetous of all animals.
>>
>> compromise. chuckle.
>
> So, I take it that if you fed a meal which is a wonderfully prepared, 10
> pound, filet mignon you either (a) eat all of it or (b) eat none of it?

no, i'd eat enough so that i was sustained - not so much that i could not
defend myself if attacked, or so much that i could not drink, or so much
that i could not shelter myself. i would eat what was appropriate for my
survival. if it were rotted, yet wonderfully prepaired, i probably wouldn't
eat it because i would become ill.

all of which affects my survivability.

> or,
>
> If you are faced with a court appearance for excessive speeding in your
> car you should either be acquitted or should get the death sentence?

i should not speed if i don't like the consequences.

however, your example is completely non sequitur, as my appearance in court
is not tied to the judgement in the sentence. but in order to indulge, if
the court deems acquittal or death, it will do so based on the circumstances
and how my actions effected the survivability (well being) of the group
under which the judge(s) serve(s).

> On one project about 25 years ago I needed to modify a very large
> application that was written in Fortran. I needed dynamic allocation.
> According to you, I should have been faced with two choices. One was to
> emulate dynamic allocation by setting aside a large part of memory and
> doing my own allocation from that memory heap. A second would have been
> to totally rewrite that entire (largggggeeeee) application in C. I chose
> a "compromise". I wrote a small module in C and used that in conjunction
> with the rest of the Fortran code.

according to me? your options are your options. you made a choice. that
choice did not involve programming. it involved architecture. if you chose
to emulate dynamic allocation, you would have done so concretely and there
would be no compromise, no choice in how that code was interpreted by the
server. your instructions would have been "either or", not "when you feel
like it". even bugs or the omission of logic are concrete and predictable if
the inputs are known. 3/4 of the code i write (or don't write, specifically)
are from logical omissions; handling only what i must in order to get inputs
where they can either be thrown out or processed.

> The point here is that there are two extremes in handling his situation.
> Either avoid a database and just use the file system, or avoid the file
> system and put all of the contents of the file into a blob field in the
> database. Often, the better way is to use the database as a rapid search
> engine for a file in the file system.

choices, whether deemed extreme or simple, are still just options. when you
program, you do so concretely.

> I guess you aren't married? I have been for over four decades. Believe
> me, "all or nothing" just doesn't work. Even with a swich for the lights
> you can always add a dimmer.

my marital status has no bearing on my thought processes. if you've
"compromised" on who you are or in what you believe because you decided to
take a spouse, you ought to have demanded more from your spouse...and your
life.

again though, my choices (all of them) should be concrete regarless of how
many options there are. whether i account my spouse into the equation of
which i shall select, the one chosen will most definitely be from
selfishness born of survival - what is in my best interest. hell, "selfless"
acts are the most overtly selfish acts of all, endearing the actor to his
society and thus making his likelihood to survive all the more certain - and
if dead because of such an act, marked in that culture's history...extending
his 'life' much further than if he'd have led a 'normal' life.

> By the way, I have been programming four over forty years. We are not
> talking ones and zeros, true or false, here. We are talking design
> philosophy -- and that if usually a compromise among various alternatives
> to achieve the most efficient results in the shortest time for the least
> cost.

oh but we are talking about ones and zeros. that's programming. design is
about options, not the act of programming itself. but just like design and
life, there are always alternatives. whatever the context, seeing the
presence or blending of options as a compromise is a faulty
premise/perspective, one from which the best advantages thereof are often
overlooked.

my point: programming is vastly different than life. it is completely black
and white, sharing only with it an array of perspectives from which it will
be engaged...ultimately leaving a single mark in one of two states; a one or
a zero, do or do not.

Re: which is the better option for directory hashing to store large number of image files?

am 17.09.2007 20:29:09 von Andy Hassall

On Mon, 17 Sep 2007 00:09:14 -0700, theCancerus wrote:

>My problem is that i have to upload images and store them. I am using
>filesystem for that.
>
>setup is something like this, their will be items/groups/user each can
>have upto 6 images which needs to be scaled to 4 different sizes ie
>every item can have upto 24 images of varying sizes.
>
>now the standard way of storing these files would be to store them in
>subdirectories based on some hash.
>
>my partial solution is to split the four types of files into four
>fixed base folders for each dimension,
>
>since filename is in format "YmdHis" i decided to use directory
>structure as Y/m/d/.
>but i realize that even this could be inefficient.
>
>so now i am thinking about going one more level by creating Y/m/d/H/i/
> directory structure.
>
>now my question is how to go about creating subdirectories below base
>folders, will my scheme hold or should i use md5 hash as suggested by
>others, over the filename and then take 2-3 characters and create one
>or two level of directory structure and then store the files?

Splitting the files by date (down to whatever resolution) is potentially still
susceptible to a large number arriving at the same time, and ending up with a
large number of files in a single directory. If the goal is to spread the files
across a number of directories, then you probably want the value that
determines the directories to be approximately randomly distributed, and to
have a bounded and resonable number of possible directory names.

md5 of some property (name? or even contents?) likely fits this reasonably
well. The number of bytes you use for subdirectories depends on however many
images you have. If you don't actually expose the
hash-used-for-storage-directory in the URL, then you're free to re-hash the
images' directories if you end up needing more levels to split the directories
(if it was in the URL, then it would change the URLs of all your images, which
is something to be avoided).

Substrings of just the name may work as well, although there could be a bias
to particular letters or numbers depending on where the names come from and
what language they're in.

There's more than one way to do it, as ever, and the way to go depends on what
exactly you're doing. Have you checked whether your initial assumption is true,
though? Whilst "large number of entries in a directory is slow" is true in many
filesystems, it's not a universal truth. What's the threshold for your
filesystem, and are you planning on getting anywhere close to it in the
forseeable future? (after overestimating it a bit to be safely pessimistic)

--
Andy Hassall :: andy@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool

Re: which is the better option for directory hashing to store largenumber of image files?

am 17.09.2007 22:38:34 von Jerry Stuckle

Shelly wrote:
> "Steve" wrote in message
> news:3rwHi.805$3C.788@newsfe05.lga...
>>> Moral: Programming, as well as life, is not always an either-or.
>>> Sometimes a compromise/hybrid is the best solution.
>>>
>>> --
>>> Shelly
>> ahhh, but shelly, the thing i like most is that in programming, it is
>> always either/or: on/off. to say otherwise is to not know programming. the
>> same holds true for life. you either do or do not. any notions about the
>> nobility or superiority of human action in his contemplation of life are
>> simply false, save the fact that there is none of either. do or do not is
>> all that remains and that directly linked to his own survivability - as is
>> the impetous of all animals.
>>
>> compromise. chuckle.
>
> So, I take it that if you fed a meal which is a wonderfully prepared, 10
> pound, filet mignon you either (a) eat all of it or (b) eat none of it?
>

(a). (b) is not even an option!

> or,
>
> If you are faced with a court appearance for excessive speeding in your car
> you should either be acquitted or should get the death sentence?
>

No, but I should either be acquitted or found guilty. And if found
guilty, I should receive the appropriate punishment. The death sentence
is not appropriate for all infractions.

> On one project about 25 years ago I needed to modify a very large
> application that was written in Fortran. I needed dynamic allocation.
> According to you, I should have been faced with two choices. One was to
> emulate dynamic allocation by setting aside a large part of memory and doing
> my own allocation from that memory heap. A second would have been to
> totally rewrite that entire (largggggeeeee) application in C. I chose a
> "compromise". I wrote a small module in C and used that in conjunction with
> the rest of the Fortran code.
>

What is your point?

> The point here is that there are two extremes in handling his situation.
> Either avoid a database and just use the file system, or avoid the file
> system and put all of the contents of the file into a blob field in the
> database. Often, the better way is to use the database as a rapid search
> engine for a file in the file system.
>

Sure, there are extremes. But have you actually tried storing the data
in a blob field and tuning your database for it? I thought not. Access
is quite fast - virtually always faster than a mix of the two, because
you don't have to make both a database and a file system call. Less
overhead - the database returns the blob just as effectively as it does
a file name.

> I guess you aren't married? I have been for over four decades. Believe me,
> "all or nothing" just doesn't work. Even with a swich for the lights you
> can always add a dimmer.
>

Sure it does. If I don't let my wife have her own way ALL the time, I
get "nothing". :-)

> By the way, I have been programming four over forty years. We are not
> talking ones and zeros, true or false, here. We are talking design
> philosophy -- and that if usually a compromise among various alternatives to
> achieve the most efficient results in the shortest time for the least cost.
>
> Shelly
>
>

Sure we are. Everything in programming comes down to ones and zeros.
It's just the approach to getting there that differs.

--
==================
Remove the "x" from my email address
Jerry Stuckle
JDS Computer Training Corp.
jstucklex@attglobal.net
==================

Re: which is the better option for directory hashing to store large number of image files?

am 18.09.2007 07:26:12 von theCancerus

On Sep 17, 11:29 pm, Andy Hassall wrote:
> On Mon, 17 Sep 2007 00:09:14 -0700, theCancerus wrote:
> >My problem is that i have to upload images and store them. I am using
> >filesystem for that.
>
> >setup is something like this, their will be items/groups/user each can
> >have upto 6 images which needs to be scaled to 4 different sizes ie
> >every item can have upto 24 images of varying sizes.
>
> >now the standard way of storing these files would be to store them in
> >subdirectories based on some hash.
>
> >my partial solution is to split the four types of files into four
> >fixed base folders for each dimension,
>
> >since filename is in format "YmdHis" i decided to use directory
> >structure as Y/m/d/.
> >but i realize that even this could be inefficient.
>
> >so now i am thinking about going one more level by creating Y/m/d/H/i/
> > directory structure.
>
> >now my question is how to go about creating subdirectories below base
> >folders, will my scheme hold or should i use md5 hash as suggested by
> >others, over the filename and then take 2-3 characters and create one
> >or two level of directory structure and then store the files?
>
> Splitting the files by date (down to whatever resolution) is potentially still
> susceptible to a large number arriving at the same time, and ending up with a
> large number of files in a single directory. If the goal is to spread the files
> across a number of directories, then you probably want the value that
> determines the directories to be approximately randomly distributed, and to
> have a bounded and resonable number of possible directory names.
>
> md5 of some property (name? or even contents?) likely fits this reasonably
> well. The number of bytes you use for subdirectories depends on however many
> images you have. If you don't actually expose the
> hash-used-for-storage-directory in the URL, then you're free to re-hash the
> images' directories if you end up needing more levels to split the directories
> (if it was in the URL, then it would change the URLs of all your images, which
> is something to be avoided).
>
> Substrings of just the name may work as well, although there could be a bias
> to particular letters or numbers depending on where the names come from and
> what language they're in.
>
> There's more than one way to do it, as ever, and the way to go depends on what
> exactly you're doing. Have you checked whether your initial assumption is true,
> though? Whilst "large number of entries in a directory is slow" is true in many
> filesystems, it's not a universal truth. What's the threshold for your
> filesystem, and are you planning on getting anywhere close to it in the
> forseeable future? (after overestimating it a bit to be safely pessimistic)
>
> --
> Andy Hassall :: a...@andyh.co.uk ::http://www.andyh.co.ukhttp://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool

hi Andy,

thanks for sensible reply.
we need to upload around 2.5 million images as seed data for the
website. we are using linux system(centos ) so any ideas what would be
the reasonable number of files per directory?

and unless thousands of users want to upload images at the same time i
am sure it will never happen that their are large number of files in
one directory every minute.

anyways i have decided to go with MD5 as 3/3 leter combination gives
me good spread for long time :)

Re: which is the better option for directory hashing to store large number of image files?

am 18.09.2007 21:42:36 von Andy Hassall

On Tue, 18 Sep 2007 05:26:12 -0000, theCancerus wrote:

>On Sep 17, 11:29 pm, Andy Hassall wrote:
>>
>> There's more than one way to do it, as ever, and the way to go depends on what
>> exactly you're doing. Have you checked whether your initial assumption is true,
>> though? Whilst "large number of entries in a directory is slow" is true in many
>> filesystems, it's not a universal truth. What's the threshold for your
>> filesystem, and are you planning on getting anywhere close to it in the
>> forseeable future? (after overestimating it a bit to be safely pessimistic)
>>
>thanks for sensible reply.
>we need to upload around 2.5 million images as seed data for the
>website. we are using linux system(centos ) so any ideas what would be
>the reasonable number of files per directory?

So, you're probably using the ext3 filesystem? This has an option for "hashed
b-tree" storage of directory entries, which helps with the
large-number-of-files issue (at least, the relevant part of it - obviously it
still takes a while to iterate through them all, but accessing one file that
you already know the filename of doesn't have the same problems as older
filesystems that do a linear scan every time).

On my CentOS system:

# tune2fs -l /dev/mapper/VolGroup00-LogVol00 | grep features
Filesystem features: has_journal ext_attr resize_inode dir_index filetype
needs_recovery sparse_super large_file

The "dir_index" option says it's turned on for me, and I didn't change it, so
it must be the default.

I don't know what the limits of this are, though.

--
Andy Hassall :: andy@andyh.co.uk :: http://www.andyh.co.uk
http://www.andyhsoftware.co.uk/space :: disk and FTP usage analysis tool