List<> versus HashSet<>: Which is more efficient for lookups on small sets?

List<> versus HashSet<>: Which is more efficient for lookups on small sets?

am 06.04.2008 17:53:04 von Jules Winfield

Assume I have a set of items, integers let's say, and I need to determine if
a particular integer exists within the set.

If the list contained 10,000 items, clearly I'd want to use some type of
dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
List<>. If the list only contained two or three items, a simple array or
List<> would probably be more efficient. Is there some general rule that
dictates the size at which one would be better off using a dictionary-based
class as opposed to an array or List<> for lookups?

Thanks,

Jules

Re: List<> versus HashSet<>: Which is more efficient for lookupson small sets?

am 06.04.2008 18:32:03 von Jeroen Mostert

Jules Winfield wrote:
> Assume I have a set of items, integers let's say, and I need to determine if
> a particular integer exists within the set.
>
> If the list contained 10,000 items, clearly I'd want to use some type of
> dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
> List<>. If the list only contained two or three items, a simple array or
> List<> would probably be more efficient. Is there some general rule that
> dictates the size at which one would be better off using a dictionary-based
> class as opposed to an array or List<> for lookups?
>
No. Profile and see. Rules of thumb can be formulated, but it's still
heavily dependent on specific use (not to mention processor architecture,
cache sizes and all sorts of other things you really don't want to be
bothering with in general). I usually put the cutoff at 100 items as an
order of magnitude approximation (i.e. if I know it will never be more than
100 I use a list), but I don't kid myself thinking that this will always
yield optimal results.

The important thing to keep in mind is scalability. Assuming that a
collection will always be small is a dangerous thing to do and requires that
one can show why not only why this is so but also why it pays off to count
on it. It's obviously much more costly to gamble wrong when you assume
collections will be small than it is to gamble wrong when you assume they'll
be big.

--
J.

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

am 06.04.2008 19:52:38 von Barry Kelly

Jules Winfield wrote:

> Assume I have a set of items, integers let's say, and I need to determine if
> a particular integer exists within the set.
>
> If the list contained 10,000 items, clearly I'd want to use some type of
> dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
> List<>. If the list only contained two or three items, a simple array or
> List<> would probably be more efficient. Is there some general rule that
> dictates the size at which one would be better off using a dictionary-based
> class as opposed to an array or List<> for lookups?

A general rule? Don't micro-optimize, and use a library collection with
the right API and, if possible, performance curve (i.e. big-Oh) for your
data set. The right API is more important: you can replace the
implementation later if measurements show that it matters.

For cases where it really does matter, you need to measure. From my
measurements (microbenchmarks with a hot cache), the switchover to
Dictionary<> or HashSet<> being more efficient than a simple array comes
pretty quickly, and they're not particularly slow to start with
(normally beating List on almost anything):

Total times for 1310720 iterations of 2 probes in collections of size 4:
List (linear scan) 0.186 sec
List (binary search) 0.231 sec
HashSet 0.140 sec
Dictionary 0.154 sec
int[] (manual inline scan) 0.109 sec
int[] (Array.IndexOf) 0.154 sec
int[] (Array.BinarySearch) 0.190 sec

Total times for 327680 iterations of 10 probes in collections of size
30:
List (linear scan) 0.412 sec
List (binary search) 0.330 sec
HashSet 0.165 sec
Dictionary 0.181 sec
int[] (manual inline scan) 0.183 sec
int[] (Array.IndexOf) 0.250 sec
int[] (Array.BinarySearch) 0.285 sec

Total times for 327680 iterations of 10 probes in collections of size
100:
List (linear scan) 0.903 sec
List (binary search) 0.392 sec
HashSet 0.170 sec
Dictionary 0.181 sec
int[] (manual inline scan) 0.338 sec
int[] (Array.IndexOf) 0.382 sec
int[] (Array.BinarySearch) 0.353 sec

Bear in mind that linear scans through an array are probably slightly
kinder on the cache than HashSet, but linear scans only scale linearly.
HashSet is probably quite a bit kinder on cache than (e.g.)
SortedDictionary, but of course, it's not sorted.

* List tests done with IndexOf & BinarySearch
* manual inline scan is a hand-coded for-loop
* The numbers above were gotten by doubling the iteration count,
(starting with 10), until the fastest time was greater than 1 ms.
* The profiling infrastructure is a constant baked into every time - it
includes calling through a delegate
* Times are best run of 5 on a Q6600 @ 2.8GHz, .NET 3.5, release mode.

-- Barry

--
http://barrkel.blogspot.com/

RE: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

am 07.04.2008 04:15:14 von stcheng

Hi Jules,

For such performance issue, it's hard to give a definite number or boundary
based on appearnce, I agree other members's suggestion on profiling the
application to get the actual result.

For general rules, if the count of elements is a small fixed number(such as
less than 100 and will not increase ), it's ok to use simple List or Array
since it just take little constant time. However, if the count may vary(in
large range) and the search is performing frequently against it, use a more
search-ready structure(such as binary sort tree or hashtable) is preferred.
I suggest you have a look at the book "Programming Pearls" which may give
you some good ideas on this.

Sincerely,

Steven Cheng

Microsoft MSDN Online Support Lead


Delighting our customers is our #1 priority. We welcome your comments and
suggestions about how we can improve the support we provide to you. Please
feel free to let my manager know what you think of the level of service
provided. You can send feedback directly to my manager at:
msdnmg@microsoft.com.

==================================================
Get notification to my posts through email? Please refer to
http://msdn.microsoft.com/subscriptions/managednewsgroups/de fault.aspx#notif
ications.

==================================================
This posting is provided "AS IS" with no warranties, and confers no rights.
--------------------
>NNTP-Posting-Date: Sun, 06 Apr 2008 10:53:04 -0500
>From: "Jules Winfield"
>Newsgroups: microsoft.public.dotnet.general
>Subject: List<> versus HashSet<>: Which is more efficient for lookups on
small sets?
>Date: Sun, 6 Apr 2008 08:53:04 -0700

>Assume I have a set of items, integers let's say, and I need to determine
if
>a particular integer exists within the set.
>
>If the list contained 10,000 items, clearly I'd want to use some type of
>dictionary class (Dictionary<> or HashSet<> or SortedList<>) instead of a
>List<>. If the list only contained two or three items, a simple array or
>List<> would probably be more efficient. Is there some general rule that
>dictates the size at which one would be better off using a
dictionary-based
>class as opposed to an array or List<> for lookups?
>
>Thanks,
>
>Jules
>
>
>

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

am 07.04.2008 09:43:23 von perseus.usenetNOSPAM

Barry Kelly wrote:

> Jules Winfield wrote:
>
> > Assume I have a set of items, integers let's say, and I need to
> > determine if a particular integer exists within the set.
> >
> > If the list contained 10,000 items, clearly I'd want to use some
> > type of dictionary class (Dictionary<> or HashSet<> or
> > SortedList<>) instead of a List<>. If the list only contained two
> > or three items, a simple array or List<> would probably be more
> > efficient. Is there some general rule that dictates the size at
> > which one would be better off using a dictionary-based class as
> > opposed to an array or List<> for lookups?
>
> A general rule? Don't micro-optimize, and use a library collection
> with the right API and, if possible, performance curve (i.e. big-Oh)
> for your data set. The right API is more important: you can replace
> the implementation later if measurements show that it matters.

Indeed, one should go for the construct with the best O() performance.
HashSet has O(1) and List has O(n) for looking up an element, so
that's pretty clear. What I find interesting is that your benchmarks
don't show this. ->

> For cases where it really does matter, you need to measure. From my
> measurements (microbenchmarks with a hot cache), the switchover to
> Dictionary<> or HashSet<> being more efficient than a simple array
> comes pretty quickly, and they're not particularly slow to start with
> (normally beating List on almost anything):
>
> Total times for 1310720 iterations of 2 probes in collections of size
> 4: List (linear scan) 0.186 sec
> List (binary search) 0.231 sec
> HashSet 0.140 sec
> Dictionary 0.154 sec
> int[] (manual inline scan) 0.109 sec
> int[] (Array.IndexOf) 0.154 sec
> int[] (Array.BinarySearch) 0.190 sec

In a way this is a bit surprising. Especially since a List uses
internally an int[] construct to store the elements.

So my question then is: what exactly did you lookup and where were
these values located in the set of 4 possible elements? O(1) is real
time, not amortized, for HashSet, but your test suggests otherwise
(amortized). If your tests in general had the element to find in the
first half of the linear list for example, it's not a fair benchmark.

FB
--
------------------------------------------------------------ ------------
Lead developer of LLBLGen Pro, the productive O/R mapper for .NET
LLBLGen Pro website: http://www.llblgen.com
My .NET blog: http://weblogs.asp.net/fbouma
Microsoft MVP (C#)
------------------------------------------------------------ ------------

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

am 07.04.2008 11:22:09 von Barry Kelly

Frans Bouma [C# MVP] wrote:

> Barry Kelly wrote:
>
> > For cases where it really does matter, you need to measure. From my
> > measurements (microbenchmarks with a hot cache), the switchover to
> > Dictionary<> or HashSet<> being more efficient than a simple array
> > comes pretty quickly, and they're not particularly slow to start with
> > (normally beating List on almost anything):
> >
> > Total times for 1310720 iterations of 2 probes in collections of size
> > 4: List (linear scan) 0.186 sec
> > List (binary search) 0.231 sec
> > HashSet 0.140 sec
> > Dictionary 0.154 sec
> > int[] (manual inline scan) 0.109 sec
> > int[] (Array.IndexOf) 0.154 sec
> > int[] (Array.BinarySearch) 0.190 sec
>
> In a way this is a bit surprising. Especially since a List uses
> internally an int[] construct to store the elements.
>
> So my question then is: what exactly did you lookup and where were
> these values located in the set of 4 possible elements? O(1) is real
> time, not amortized, for HashSet, but your test suggests otherwise
> (amortized).

For small values of n, the fixed overhead dominates, I would think. For
larger values of n, some cache effects should start coming in.

> If your tests in general had the element to find in the
> first half of the linear list for example, it's not a fair benchmark.

Sure, for a very small set of probes (2 in this case), the data will
probably be skewed - but the total time shouldn't be high in any case.
We are talking >1 million iterations in a fifth of a second or so.

The code was using new Random(42).Next(dataset_count) as its series for
numbers to probe for. Here's the full code I used:

---8<---
using System;
using System.Collections.Generic;
using System.Linq;
using System.Diagnostics;

class App
{
const int ProfileIterCount = 10;
const int ProfileTestCount = 5;
const double MinTime = 0.1;

static double Profile(Action code, int iterCount)
{
Stopwatch w = Stopwatch.StartNew();
for (int i = 0; i < iterCount; ++i)
code();
return w.ElapsedTicks / (double) Stopwatch.Frequency;
}

static double Profile(Action setup, Action code, int iterCount)
{
double bestResult = Double.PositiveInfinity;

for (int i = 0; i < ProfileTestCount; ++i)
{
if (setup != null)
setup();
double result = Profile(code, iterCount);
if (result < bestResult)
bestResult = result;
}

return bestResult;
}

class ProfileItem
{
public string Name { get; set; }
public Action Code { get; set; }
public Action Setup { get; set; }
public double BestTime { get; set; }

// Time will be at least 0.1 seconds, therefore we can get away
// with three digits.
public override string ToString()
{
return string.Format("{0,-35} {1,7:F3} sec", Name,
BestTime);
}
}

static int Profile(params ProfileItem[] items)
{
int iterCount = ProfileIterCount;

too_fast:
foreach (var item in items)
{
item.BestTime = Profile(item.Setup, item.Code, iterCount);
if (item.BestTime < MinTime)
{
iterCount *= 2;
goto too_fast;
}
}

return iterCount;
}

static void Main(string[] args)
{
try
{
int itemCount = args.Length > 0 ? int.Parse(args[0]) : 5;
int probeCount = args.Length > 1 ? int.Parse(args[1])
: itemCount / 2;
List intList = null;
HashSet intHash = null;
Dictionary intDict = null;
int[] intArray = null;
Random r = null;

ProfileItem[] items = new[]
{
new ProfileItem()
{
Name = "List (linear scan)",
Setup = delegate
{
intList = new List(
Enumerable.Range(0, itemCount));
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intList.Count);
if (!intList.Contains(test))
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "List (binary search)",
Setup = delegate
{
intList = new List(
Enumerable.Range(0, itemCount));
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intList.Count);
if (intList.BinarySearch(test) < 0)
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "HashSet",
Setup = delegate
{
intHash = new HashSet(
Enumerable.Range(0, itemCount));
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intHash.Count);
if (!intHash.Contains(test))
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "Dictionary",
Setup = delegate
{
intDict = new Dictionary();
for (int i = 0; i < itemCount; ++i)
intDict.Add(i, i);
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intDict.Count);
if (!intDict.ContainsKey(test))
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "int[] (manual inline scan)",
Setup = delegate
{
intArray = new int[itemCount];
for (int i = 0; i < itemCount; ++i)
intArray[i] = i;
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intArray.Length);
int foundIndex = -1;
for (int j = 0; j < intArray.Length; ++j)
if (intArray[j] == test)
{
foundIndex = j;
break;
}
if (foundIndex < 0)
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "int[] (Array.IndexOf)",
Setup = delegate
{
intArray = new int[itemCount];
for (int i = 0; i < itemCount; ++i)
intArray[i] = i;
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intArray.Length);
int foundIndex =
Array.IndexOf(intArray, test);
if (foundIndex < 0)
Console.WriteLine("bad news");
}
},
},

new ProfileItem()
{
Name = "int[] (Array.BinarySearch)",
Setup = delegate
{
intArray = new int[itemCount];
for (int i = 0; i < itemCount; ++i)
intArray[i] = i;
r = new Random(42);
},
Code = delegate
{
for (int i = 0; i < probeCount; ++i)
{
int test = r.Next(intArray.Length);
int foundIndex =
Array.BinarySearch(intArray, test);
if (foundIndex < 0)
Console.WriteLine("bad news");
}
},
},
};

int count = Profile(items);
Console.WriteLine(
"Total times for {0} iterations of {1} probes in collections of" +
" size {2}:",
count, probeCount, itemCount);
for (int i = 0; i < items.Length; ++i)
Console.WriteLine(items[i]);
}
catch (Exception ex)
{
Console.Error.WriteLine(ex.Message);
}
}
}
--->8---

It's a micro-benchmark with an overhead in its measuring, so the usual
caveats apply.

Usage: <.exe> [item-count] [probe-count]

-- Barry

--
http://barrkel.blogspot.com/

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

am 07.04.2008 11:39:37 von Barry Kelly

Frans Bouma [C# MVP] wrote:

> Barry Kelly wrote:
> > Total times for 1310720 iterations of 2 probes in collections of size
> > 4:

I ran this again, with ranking (for comparison with later run):

---8<---
Total times for 1310720 iterations of 2 probes in collections of size 4:
List (linear scan) 0.174 sec
List (binary search) 0.213 sec
HashSet 0.133 sec
Dictionary 0.142 sec
int[] (manual inline scan) 0.102 sec
int[] (Array.IndexOf) 0.157 sec
int[] (Array.BinarySearch) 0.187 sec

Ranking:
int[] (manual inline scan) 0% slower than best
HashSet 30% slower than best
Dictionary 39% slower than best
int[] (Array.IndexOf) 54% slower than best
List (linear scan) 71% slower than best
int[] (Array.BinarySearch) 84% slower than best
List (binary search) 109% slower than best
--->8---

> So my question then is: what exactly did you lookup and where were
> these values located in the set of 4 possible elements? O(1) is real
> time, not amortized, for HashSet, but your test suggests otherwise
> (amortized). If your tests in general had the element to find in the
> first half of the linear list for example, it's not a fair benchmark.

Here's another run using more probes, to even out the distribution; it
does change things, but not by a whole lot:

---8<---
Total times for 40960 iterations of 100 probes in collections of size 4:
List (linear scan) 0.259 sec
List (binary search) 0.308 sec
HashSet 0.190 sec
Dictionary 0.203 sec
int[] (manual inline scan) 0.141 sec
int[] (Array.IndexOf) 0.244 sec
int[] (Array.BinarySearch) 0.277 sec

Ranking:
int[] (manual inline scan) 0% slower than best
HashSet 35% slower than best
Dictionary 44% slower than best
int[] (Array.IndexOf) 73% slower than best
List (linear scan) 84% slower than best
int[] (Array.BinarySearch) 97% slower than best
List (binary search) 119% slower than best
--->8---

The ranking stays the same, and the percentage differences aren't that
much different either.

Added ranking code (for the program I posted earlier):

---8<---
Console.WriteLine();
Console.WriteLine("Ranking:");
items = items.OrderBy(item => item.BestTime).ToArray();
foreach (var item in items)
{
Console.WriteLine("{0,-37} {1,5:F0}% slower than best",
item.Name,
(item.BestTime - items[0].BestTime) * 100 /
items[0].BestTime);
}
--->8---

-- Barry

--
http://barrkel.blogspot.com/

Re: List<> versus HashSet<>: Which is more efficient for lookups on small sets?

am 07.04.2008 11:46:49 von Barry Kelly

Barry Kelly wrote:

> int[] (manual inline scan) 0% slower than best

For "slower", read "excess time over best as a percentage of best time".

-- Barry

--
http://barrkel.blogspot.com/