fastest searchable datastructure?

am 16.01.2008 14:20:36 von Pieter

Hi,

I need some type of array/list/... In which I can store objects together
with a unique key. The most important thing is performance: I will need to
do the whole time searches in the list of given keys. Which datastructure
will be best suited for this? A Hashtable? The list may contain upto 10^12
items, bit more proably most of the time +- 10^9 of even 10^6...

Thanks a lot in advance,

Pieter

Re: fastest searchable datastructure?

am 16.01.2008 14:33:47 von LT_stop

10^12 items?
Maybe Sql Serever would be the best for that quantity of data..
Otherwise you might consider to use a rb tree

http://www.jot.fm/issues/issue_2005_03/column6/

"Pieter" ha scritto nel messaggio
news:OHd%23$LEWIHA.5448@TK2MSFTNGP04.phx.gbl...
> Hi,
>
> I need some type of array/list/... In which I can store objects together
> with a unique key. The most important thing is performance: I will need to
> do the whole time searches in the list of given keys. Which datastructure
> will be best suited for this? A Hashtable? The list may contain upto 10^12
> items, bit more proably most of the time +- 10^9 of even 10^6...
>
> Thanks a lot in advance,
>
>
> Pieter
>

Re: fastest searchable datastructure?

am 16.01.2008 15:34:26 von Pieter

Thansk both for your answer!

Actually: it's not a type: it could go theoretical upto 6*10^12 :-)
But as I said: I expect a practical implementation of +- 10^6...

I will use indeed SQL Server when the amoutn will be too big... But in case
it's underneath 10^6: Will a dictionary be better than a Hashtable? Or
soemthing else?

"Marc Gravell" wrote in message
news:e0a26adb-39fe-4600-bdf4-261730e28d81@h11g2000prf.google groups.com...
>> I need some type of array/list/... In which I can store objects together
>> with a unique key.
> Sounds like Dictionary...
>
>> The list may contain upto 10^12
> Seriously? You do realise that even at one byte per item, with no
> index overhead, padding, etc that's a TB?
>
> I'm going to assume that is a typo - but even so, for large numbers
> you may be better using a database approach, with a non-clustered
> unique index on this field (only; don't span - let it use a bookmark
> lookup to get the value) and ideally with the index on a different
> file-group to the main data.
>
> Marc

RE: fastest searchable datastructure?

am 16.01.2008 15:39:00 von pbromberg

Do you know how much memory 10^12 of these "items" is going to require?
Because if you run out of physical RAM, your whole process will either blow
up or slow down dramatically due to paging.
So the best alternative may be a database to hold your items. A search on a
table with a clustered index on a primary key in this case will be the
fastest.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
MetaFinder: http://www.blogmetafinder.com

"Pieter" wrote:

> Hi,
>
> I need some type of array/list/... In which I can store objects together
> with a unique key. The most important thing is performance: I will need to
> do the whole time searches in the list of given keys. Which datastructure
> will be best suited for this? A Hashtable? The list may contain upto 10^12
> items, bit more proably most of the time +- 10^9 of even 10^6...
>
> Thanks a lot in advance,
>
>
> Pieter
>
>
>

Re: fastest searchable datastructure?

am 16.01.2008 15:58:01 von Pieter

Hehe I do know :-)
The problem is: it will be for an experiment, and not every possiblity will
happen as much as the others. So I want definetly put the most popular in
some kind of local cache... See it as putting the first block of a B-tree in
the RAM memory...

"Peter Bromberg [C# MVP]" wrote in message
news:E081A83D-31AB-4309-A50A-DCEEAD905AD2@microsoft.com...
> Do you know how much memory 10^12 of these "items" is going to require?
> Because if you run out of physical RAM, your whole process will either
> blow
> up or slow down dramatically due to paging.
> So the best alternative may be a database to hold your items. A search on
> a
> table with a clustered index on a primary key in this case will be the
> fastest.
> -- Peter
> Site: http://www.eggheadcafe.com
> UnBlog: http://petesbloggerama.blogspot.com
> MetaFinder: http://www.blogmetafinder.com
>
>
> "Pieter" wrote:
>
>> Hi,
>>
>> I need some type of array/list/... In which I can store objects together
>> with a unique key. The most important thing is performance: I will need
>> to
>> do the whole time searches in the list of given keys. Which datastructure
>> will be best suited for this? A Hashtable? The list may contain upto
>> 10^12
>> items, bit more proably most of the time +- 10^9 of even 10^6...
>>
>> Thanks a lot in advance,
>>
>>
>> Pieter
>>
>>
>>

Re: fastest searchable datastructure?

am 16.01.2008 16:04:35 von lasse

Pieter wrote:
> Thansk both for your answer!
>
> Actually: it's not a type: it could go theoretical upto 6*10^12 :-)
> But as I said: I expect a practical implementation of +- 10^6...
>
> I will use indeed SQL Server when the amoutn will be too big... But in case
> it's underneath 10^6: Will a dictionary be better than a Hashtable? Or
> soemthing else?
>
> "Marc Gravell" wrote in message
> news:e0a26adb-39fe-4600-bdf4-261730e28d81@h11g2000prf.google groups.com...
>>> I need some type of array/list/... In which I can store objects together
>>> with a unique key.
>> Sounds like Dictionary...
>>
>>> The list may contain upto 10^12
>> Seriously? You do realise that even at one byte per item, with no
>> index overhead, padding, etc that's a TB?
>>
>> I'm going to assume that is a typo - but even so, for large numbers
>> you may be better using a database approach, with a non-clustered
>> unique index on this field (only; don't span - let it use a bookmark
>> lookup to get the value) and ideally with the index on a different
>> file-group to the main data.
>>
>> Marc
>
>
I would carefully avoid solutions that require you to write the same
type of code multiple times, one type for < 10^6, one for <10^9 and one
for <10^12. If you need as many as 10^12, use a database.

If you absolutely want to draw the best performance out of everything,
abstract out the storage of this list to a new class, so that you can
inherit for it for an in-memory data structure, and inherit from it for
a database structure.

--
Lasse Vågsæther Karlsen
mailto:lasse@vkarlsen.no
http://presentationmode.blogspot.com/
PGP KeyID: 0xBCDEA2E3

Re: fastest searchable datastructure?

am 16.01.2008 16:08:34 von Pieter

Thanks Lasse. What it actually will do is: every object will be another
possivble state. But some states will be much more needed than others. So
the 'popular' ones will be put in a local cache. But it's the structure of
that cache that I'm worrying about: As mayb 80% of the searching will happen
in there, it shoudl be as fast as possible: So: which structure to use?

"Lasse Vågsæther Karlsen" wrote in message
news:%23DjAODFWIHA.4712@TK2MSFTNGP04.phx.gbl...
> I would carefully avoid solutions that require you to write the same type
> of code multiple times, one type for < 10^6, one for <10^9 and one for
> <10^12. If you need as many as 10^12, use a database.
>
> If you absolutely want to draw the best performance out of everything,
> abstract out the storage of this list to a new class, so that you can
> inherit for it for an in-memory data structure, and inherit from it for a
> database structure.
>
> --
> Lasse Vågsæther Karlsen
> mailto:lasse@vkarlsen.no
> http://presentationmode.blogspot.com/
> PGP KeyID: 0xBCDEA2E3

Re: fastest searchable datastructure?

am 16.01.2008 16:35:37 von Alberto Poblacion

"Pieter" wrote in message
news:OHd%23$LEWIHA.5448@TK2MSFTNGP04.phx.gbl...
> I need some type of array/list/... In which I can store objects together
> with a unique key. The most important thing is performance: I will need to
> do the whole time searches in the list of given keys. Which datastructure
> will be best suited for this? A Hashtable? The list may contain upto 10^12
> items, bit more proably most of the time +- 10^9 of even 10^6...

You will get the fastest performance with a Hash Table. Assuming that
you choose a good algorithm to assign the hash values, the hash table has
the advantage that the average number of accesses to the table to find a key
does not depend on the size of the table, but only on the percent of the
table that is full. With a table that is not terribly full (say 80 o 90%, I
don't remember the figures), the average number of accesses is one point
something. This beats a search on a B-Tree, which requires a number of
accesses that grows with the logarithm of the size of the table. Note that
I'm speaking about the *average* number of accesses. The *worst-case*
scenario is horribly bad, since it would require a full-scan of the hash
table. Fortunately, this is extremely improbable, on the condition that the
hashing algorithm and the collisions algorithm are properly chosen.

If you are going to deal with in-memory elemets using .Net, you can use
a Dictionary, which will automatically use a hash table when the
number of stored elements is larger than some internally coded threshold.
You will need to use 64-bit code (and run it in a huge machine) if you want
to address 10^12 elements. If you need to store your data on disk, a
properly programmed hashing algorithm against a flat file can outperform a
database server, which uses trees to store its indices.

RE: fastest searchable datastructure?

am 16.01.2008 16:59:00 von pbromberg

I'd think that a Generic List or Generic Dictionary would be faster as there
would be no boxing / unboxing involved in adding or retrieving typed elements.
-- Peter
Site: http://www.eggheadcafe.com
UnBlog: http://petesbloggerama.blogspot.com
MetaFinder: http://www.blogmetafinder.com

"Pieter" wrote:

> Hi,
>
> I need some type of array/list/... In which I can store objects together
> with a unique key. The most important thing is performance: I will need to
> do the whole time searches in the list of given keys. Which datastructure
> will be best suited for this? A Hashtable? The list may contain upto 10^12
> items, bit more proably most of the time +- 10^9 of even 10^6...
>
> Thanks a lot in advance,
>
>
> Pieter
>
>
>

Re: fastest searchable datastructure?

am 16.01.2008 18:27:27 von notmyfirstname

Pieter,

In past there was in this newsgroup somebody from Belgie active who was
always answering this question with.

Sorted List

http://msdn2.microsoft.com/en-us/library/system.collections. sortedlist.aspx

I have no expirience with that

Cor

Re: fastest searchable datastructure?

am 16.01.2008 18:46:24 von Kevin Spencer

A Hashtable would be faster in terms of searching, as long as one was
searching by a key value.

--
HTH,

Kevin Spencer
Chicken Salad Surgeon
Microsoft MVP

"Peter Bromberg [C# MVP]" wrote in message
news:E9AA9507-D257-421F-A894-185ED54A0BFF@microsoft.com...
> I'd think that a Generic List or Generic Dictionary would be faster as
> there
> would be no boxing / unboxing involved in adding or retrieving typed
> elements.
> -- Peter
> Site: http://www.eggheadcafe.com
> UnBlog: http://petesbloggerama.blogspot.com
> MetaFinder: http://www.blogmetafinder.com
>
>
> "Pieter" wrote:
>
>> Hi,
>>
>> I need some type of array/list/... In which I can store objects together
>> with a unique key. The most important thing is performance: I will need
>> to
>> do the whole time searches in the list of given keys. Which datastructure
>> will be best suited for this? A Hashtable? The list may contain upto
>> 10^12
>> items, bit more proably most of the time +- 10^9 of even 10^6...
>>
>> Thanks a lot in advance,
>>
>>
>> Pieter
>>
>>
>>

Re: fastest searchable datastructure?

am 16.01.2008 22:37:14 von msdn

Hello Pieter

Balena did some tests and posted the results on his website , however with
the transition form VB2themax to dotnet2themax the article seems to be
disapeared
However the result was that the hashtable was and still is the fastest key
value pair object , the higher the amount of data was the more efiicient the
hashtable was in comparisation to other methods for sowell insertion as
retrieving of the key value pairs ( Balena timed both the time it took to
built up the data tree as the time it took to retrieve n elements ) ,
however with the hughe amounts you are talking about a database would be a
much bether idea

HTH

Michel

"Pieter" schreef in bericht
news:OHd%23$LEWIHA.5448@TK2MSFTNGP04.phx.gbl...
> Hi,
>
> I need some type of array/list/... In which I can store objects together
> with a unique key. The most important thing is performance: I will need to
> do the whole time searches in the list of given keys. Which datastructure
> will be best suited for this? A Hashtable? The list may contain upto 10^12
> items, bit more proably most of the time +- 10^9 of even 10^6...
>
> Thanks a lot in advance,
>
>
> Pieter
>

Re: fastest searchable datastructure?

am 16.01.2008 22:59:55 von msdn

Found some numbers ( needed to do some translation )

The test was with 100.000 items

compared was ArrayList, HashTable en SortedList

Arraylist was 4 times faster as the hashtable , hashtable was 8 - 100 times
faster as the sortedlist

please if you write some comparisation routines yourself then don`t forget
to set compile options in VB.Net remove overflow checks and enable
optimizations
otherwise we get again trolling messages here that C# is so much faster as
VB.Net with these options the same as they are in C# they should evaluate at
the same speed

"Pieter" schreef in bericht
news:OHd%23$LEWIHA.5448@TK2MSFTNGP04.phx.gbl...
> Hi,
>
> I need some type of array/list/... In which I can store objects together
> with a unique key. The most important thing is performance: I will need to
> do the whole time searches in the list of given keys. Which datastructure
> will be best suited for this? A Hashtable? The list may contain upto 10^12
> items, bit more proably most of the time +- 10^9 of even 10^6...
>
> Thanks a lot in advance,
>
>
> Pieter
>

Re: fastest searchable datastructure?

am 17.01.2008 09:02:10 von lasse

Pieter wrote:
> Thanks Lasse. What it actually will do is: every object will be another
> possivble state. But some states will be much more needed than others. So
> the 'popular' ones will be put in a local cache. But it's the structure of
> that cache that I'm worrying about: As mayb 80% of the searching will happen
> in there, it shoudl be as fast as possible: So: which structure to use?
>
> "Lasse Vågsæther Karlsen" wrote in message
> news:%23DjAODFWIHA.4712@TK2MSFTNGP04.phx.gbl...
>> I would carefully avoid solutions that require you to write the same type
>> of code multiple times, one type for < 10^6, one for <10^9 and one for
>> <10^12. If you need as many as 10^12, use a database.
>>
>> If you absolutely want to draw the best performance out of everything,
>> abstract out the storage of this list to a new class, so that you can
>> inherit for it for an in-memory data structure, and inherit from it for a
>> database structure.
>>
>> --
>> Lasse Vågsæther Karlsen
>> mailto:lasse@vkarlsen.no
>> http://presentationmode.blogspot.com/
>> PGP KeyID: 0xBCDEA2E3
>
>

If you have a key and need the object related to that key, use a
Dictionary for this, it uses a hashtable and is generally
the fastest one for this kind of query.

If you have a key that is "in the neighbourhood of", then you need
something that keeps values in a sorted order, a tree or similar would
be good here I think.

--
Lasse Vågsæther Karlsen
mailto:lasse@vkarlsen.no
http://presentationmode.blogspot.com/
PGP KeyID: 0xBCDEA2E3

Re: fastest searchable datastructure?

am 17.01.2008 11:39:04 von Marc Gravell

> Arraylist was 4 times faster as the hashtable
The other question would be: "at what, exactly?"

If the timings include the insertion time, then yes: a hash-table
(Hashtable or Dictionary) will take longer: it needs to
obtain hashes, build buckets, etc (an ArrayList or List just has to
copy an array a few times). But the *read* performance should then be
/significantly/ better, assuming that the items are being fetched by
key (and not by looping).

Without inspectable, reproducable code I'm going to take this with a
pinch of salt...

Marc

Re: fastest searchable datastructure?

am 17.01.2008 19:28:41 von msdn

Before anyone takes this out of perspective

1 . these tests were done by Balena not by me , however the results and code
are not annymore on the website , but i have read on a dutch website that
you can find this in his 2003 book ( i have the 2002 and 2005 version )

2 . The conclusion was that the hashtable was the fastest when values were
retrieved with the key , however a arraylist was faster when looping
through the values

"Although the statistics might tell you that the Amazone rivers average
depth is 50 cm this does not mean that you can walk to the other side"

Michel

"Marc Gravell" schreef in bericht
news:%23ym20SPWIHA.5984@TK2MSFTNGP06.phx.gbl...
>> Arraylist was 4 times faster as the hashtable
> The other question would be: "at what, exactly?"
>
> If the timings include the insertion time, then yes: a hash-table
> (Hashtable or Dictionary) will take longer: it needs to
> obtain hashes, build buckets, etc (an ArrayList or List just has to
> copy an array a few times). But the *read* performance should then be
> /significantly/ better, assuming that the items are being fetched by key
> (and not by looping).
>
> Without inspectable, reproducable code I'm going to take this with a pinch
> of salt...
>
> Marc
>

Re: fastest searchable datastructure?

am 17.01.2008 20:11:18 von skeet

Michel Posseth [MCP] wrote:
> Before anyone takes this out of perspective
>
> 1 . these tests were done by Balena not by me , however the results and code
> are not annymore on the website , but i have read on a dutch website that
> you can find this in his 2003 book ( i have the 2002 and 2005 version )
>
> 2 . The conclusion was that the hashtable was the fastest when values were
> retrieved with the key , however a arraylist was faster when looping
> through the values

Right. That makes a lot more sense. As far as I'm aware, this thread is
about looking up by key, at which point the hashtable should win
easily.

--
Jon Skeet -
http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
World class .NET training in the UK: http://iterativetraining.co.uk

Re: fastest searchable datastructure?

am 17.01.2008 21:43:26 von msdn

IMHO

The Hashtable will not work for the TS as the object limit in .Net = 2 GB
the hashtable way will blow up with a out of memory exception wit the
amounts the TS wants to store

so the only way that will probably work is the also in this thread proposed
database aproach

Michel

"Jon Skeet [C# MVP]" schreef in bericht
news:MPG.21f9b8ac1fed835a7c2@msnews.microsoft.com...
> Michel Posseth [MCP] wrote:
>> Before anyone takes this out of perspective
>>
>> 1 . these tests were done by Balena not by me , however the results and
>> code
>> are not annymore on the website , but i have read on a dutch website that
>> you can find this in his 2003 book ( i have the 2002 and 2005
>> ersion )
>>
>> 2 . The conclusion was that the hashtable was the fastest when values
>> were
>> retrieved with the key , however a arraylist was faster when looping
>> through the values
>
> Right. That makes a lot more sense. As far as I'm aware, this thread is
> about looking up by key, at which point the hashtable should win
> easily.
>
> --
> Jon Skeet -
> http://www.pobox.com/~skeet Blog: http://www.msmvps.com/jon.skeet
> World class .NET training in the UK: http://iterativetraining.co.uk

RE: fastest searchable datastructure?

am 18.01.2008 16:04:02 von TheSilverHammer

Dictionaries are close to O(1) which is as fast as you can get.

However, as others have pointed out, 10^6 is a lot of space, never mind 10^9.

If you are looking at 32 bit pointers (references) you are using 4 megs just
for the references to be held. 8 megs at 64 bit. Exactly how big are your
objects? At 10^6 you can say you will need one MEGABYTE of RAM per BYTE of
size for each of your objects. IE: An object who's size is 20 bytes will
cost you 20 megs of RAM to hold 10^6 of them.

I would seriously look at your data architecture and see if you can lower
the number of instances you need to hold at once. The only application I
can think of that would need such a huge quantity of data might be a game
terrain database. In those cases you really do not need all of the terrain
loaded at once and can section them off to blocks. In the cases where you do
need all that data available at once, you end up shrinking the map size to
something reasonable or implementing a lot of compression.

Re: fastest searchable datastructure?

am 19.01.2008 02:00:19 von Bill McCarthy

"Jon Skeet [C# MVP]" wrote in message
news:1ed2d31f-3d49-4998-9901-8e4e5b8a5a27@e23g2000prf.google groups.com...
>
> I've got a datastructure of 100 million records in memory. I probably
> shouldn't say what it's for, but it basically ends up being a 10,000
> squared lookup, with a single byte result.
>
> Doing this in memory (for the low, low cost of 100M of memory) gave me
> a ~100,000% performance improvement over a database lookup :)
>

That of course would not be a searchable datastructure, more like an indexed
array where the location is calculated by offset ;) If it was a key per
item, then each item would require at least a 4 byte key, making the in
memory size at least 500M

Re: fastest searchable datastructure?

am 24.01.2008 03:04:53 von Alex Clark

I know it's not always good netiquette to pry into other's projects in these
newsgroups, but is anyone else really interested to know what this actual
application is for...?

"Pieter" wrote in message
news:OHd%23$LEWIHA.5448@TK2MSFTNGP04.phx.gbl...
> Hi,
>
> I need some type of array/list/... In which I can store objects together
> with a unique key. The most important thing is performance: I will need to
> do the whole time searches in the list of given keys. Which datastructure
> will be best suited for this? A Hashtable? The list may contain upto 10^12
> items, bit more proably most of the time +- 10^9 of even 10^6...
>
> Thanks a lot in advance,
>
>
> Pieter
>

Re: fastest searchable datastructure?

am 24.01.2008 19:33:03 von Pieter

Oh, it's good to have a healthy curiousity ;-)

It's a Reinforcement Learning Project: Artificial Intelligence.
I'm making an agent that can play a board game using RL. To do this, it must
learn from what it did:
- the agent comes in different states
- every state has different actions
- the agent must choose between these actions
- at the end he gets his reward (win or lose). this reward is given to the
actions he choose.
By learnign (exploration) and exploitation he must maximize his wins :-)

The whole datastructure is needed to put my states in: I must be able to
search very fast is he had alreaddy been in that state :-)

It has to be finished by monday evening ;-)

"Alex Clark" wrote in message
news:%23sNaG2iXIHA.1212@TK2MSFTNGP05.phx.gbl...
>
> I know it's not always good netiquette to pry into other's projects in
> these newsgroups, but is anyone else really interested to know what this
> actual application is for...?
>