Question about session-cache algorithms

Question about session-cache algorithms

am 10.10.2002 00:54:21 von nyh

Hi. I've been looking at the algorithms modssl uses for managing the SSL
session cache in shared memory, and specifically at the newer "cyclic
buffer" algorithm.

I have an question/idea/rambling about it, it's going to be a little long,
sorry ;)

I believe the cyclic-buffer design is very good, especially its natural
handling of expiration and overflow of the cache before sessions expire.
The time-complexity of additions of new sessions to the cache, as well as
that of expiring old ones or making more room when needed, is O(1) (where N
is the number of sessions in the cache), and these operations always succeed.

However, it appears to me that the session-cache lookup is *not* O(1),
and I've been wondering if this is considered acceptable, e.g., because
other Apache and hardware constraints limit the server's scalability anyway?

As I mention below, the current lookup algorithm is indeed very good for
all "normal" session-cache sizes, so my question is more theoretical, about
how well it will scale as very-high-end servers get better specialized
hardware, have more memory, and expect to handle higher loads.

If I understand correctly, the lookup works by searching the index of a whole
"division" (when a cache is full, 1/256 of the N entries or more, usually
around sqrt(N) entries), one by one, until a match for the second byte is
found (at which point the whole session id is compared).
For N < 65000 or so, this means (again, if I understand correctly) a lookup
length of around sqrt(N)/2. For larger N, the lookup length is worse, around
N/256/2, with the search becoming even slower because the second-byte
comparisons start yielding more false positives at this point.

As always, O(sqrt(N)) or O(N) complexities become worse as N increases.
When N=65536, and the cache is full, the average number of comparisons for
a session-id lookup is (if I'm calculating correctly) 128. When N=131072,
the average number of comparisons for a lookup grows to 256, and the fact
that only 2 bytes are used for the quick comparison will start taking its
additional toll.

To put the numbers in perspective (and to show that the example Ns above
are untypically large), a session cache of size 500Kbytes is typically
used in Apache, making room for only about N=3000 session-cache entries.
In this case, the typical number of one-byte comparisons a lookup needs is
roughly 32 (if I'm calculating this correctly). This is certainly not
bad, and should give excellent results for this algorithm (as indeed I
see people are reporting).

While N=3000 is typical, N=65536 is probably not because it results in a
very big cache, about 10MB. N=131072, 20MB, is probably considered
ridiculously large. Is this why such N's were considered uninteresting
and this lookup algorithm was deemed good enough in all practical cases?

If anyone is wondering why I'm thinking about this issue, and why someone
may want more than the 500Kbyte session-cache (N=3000 sessions), think about
the following calculation:

let's say that from a bit of research you find that most browsers will reuse
a session within two minutes of its first creation. If you want the session
cache to hold each new session for two minutes (120 seconds), you would want
to hope that you will not get more than 25 (=3000/120) new sessions (i.e.,
new users) per second.
For most moderately-loaded servers with ordinary hardware, this assumption
is good - the CPU might not even be able to handle much more than 25 RSA
operations per second, and 25 new sessions a second translates to 2 million
SSL sessions a day - a very busy server indeed.
But what about a very-high-end server with an "SSL acceleration" card (a
card that, among other things, does RSA operations in hardware)? Such cards
are typically rated for doing 1,000, or even much more, RSA operations per
second, meaning that the server can theoretically now easily handle, for
example, 500 new sessions per second (I'm saying theoretically, because in
real life Apache might not be able to handle such a load because of limits
on the number of processes, memory, and so on).
500 new sessions per second, each we hope to keep for 120 seconds, dictates
60,000 entries in the session cache. For example.

Anyway, if such huge session caches will ever be of interest, perhaps the
lookup algorithm should be replaced by something else, perhaps an additional
simple hash table using the session-id as key (no need for hash function other
than a modulo)?
I'm not talking about a complicated generic hash table like the "ht" session
cache - I'm talking about a simple array of indices pointing into positions of
the cyclic buffer, and each entry in that buffer will also hold "next"
indices for keeping the bucket's chain. If we assume that no "subcache"
will hold more than 65536 entries, we can use 2-byte indices for all those
"pointers", and such hash-table's overhead will be as low as 4-6 bytes
per session in the cache. A slight computational overhead will also be
incurred for updating the hash-table structure on changes to the cyclic
buffer.
I've implemented such a data structure once (for holding a cache of recently
seen packet numbers to avoid acting on a retransmitted packet twice), and it
seemed to work quite well.

I'm guessing that because the current lookup algorithm is quite good
for all typical session-cache size, it might not make too much sense
modifying the algorithm now. but perhaps it's worth thinking about for the
future.

What do you think?


--
Nadav Har'El | Wednesday, Oct 9 2002, 4 Heshvan 5763
nyh@math.technion.ac.il |-----------------------------------------
Phone: +972-53-245868, ICQ 13349191 |Disclaimer: The opinions expressed above
http://nadav.harel.org.il |are not my own.
____________________________________________________________ __________
Apache Interface to OpenSSL (mod_ssl) www.modssl.org
User Support Mailing List modssl-users@modssl.org
Automated List Manager majordomo@modssl.org

Re: Question about session-cache algorithms

am 10.10.2002 19:31:47 von Geoff Thorpe

Hi there,

I suppose as the author of the implementation you're disecting I should
probably respond in some way :-)

I'm in agreement with most of your comments and appreciate you having
taken a look and provided some analysis. I'll respond to a couple of
specific points but will mostly make some general comments afterwards
that I think will address most points anyway.

> However, it appears to me that the session-cache lookup is *not*
> O(1), and I've been wondering if this is considered acceptable, e.g.,
> because other Apache and hardware constraints limit the server's
> scalability anyway?

If you check the "shmht" implementation (which is what "shmcb" was
created to replace) you'll see that a linear search was being conducted
on the *entire* cache, and for each lookup an ASN1 decode was being
performed. This ASN1 decode operation being repeated across hundreds of
sessions is a huge order of magnitude more computationally expensive
than the lookup in shmcb and was actually large enough to significantly
drag down the server's SSL/TLS processing capacity/sec. When a cache
lookup starts to compare to the overhead of an RSA/DSA private key
operation, you've got serious trouble. I'm quietly confident that you
could add a "for(loop=0;loop<1000;loop++){}" loop around the cache
lookup function in shmcb and you'd still not notice the slightest
difference in the SSL/TLS performance of apache. So the objective
wasn't really to create a caching system with the lowest-possible order
of complexity, it was to create a caching system in which the cache
operations were utterly insignificant compared to SSL/TLS PKC crypto
operations, as they should always have been considering the HTTPS
scenario. If you understand SSL/TLS, and in particular the RSA/DSA
intricacies, you'll know that it is utterly unforgivable that something
as simple as a key-value cache should introduce overheads that show up
in the profiling compared to 1024-bit RSA private key operations and
other SSL/TLS processing.

Anyway, more about scalability later.

> As I mention below, the current lookup algorithm is indeed very good
> for all "normal" session-cache sizes, so my question is more
> theoretical, about how well it will scale as very-high-end servers
> get better specialized hardware, have more memory, and expect to
> handle higher loads.

This is where I think my general comments will address your concerns.

> If I understand correctly, the lookup works by searching the index of
> a whole "division" (when a cache is full, 1/256 of the N entries or
> more, usually around sqrt(N) entries), one by one, until a match for
> the second byte is found (at which point the whole session id is
> compared).
> For N < 65000 or so, this means (again, if I understand correctly) a
> lookup length of around sqrt(N)/2. For larger N, the lookup length is
> worse, around N/256/2, with the search becoming even slower because
> the second-byte comparisons start yielding more false positives at
> this point.

Actually, as the cache is partitioned into subcaches - proportional to
sqrt(N) of them (though generally fewer of them than sqrt(N)) - your
lookups drill into the correct subcache in a single comparison of order
O(1) (you mask the first byte of the session_id). So, you've got a
subcache of order O(sqrt(N)) to "search" rather than N (I think you
muddled this up in your analysis). The second-byte tests in the
subcache iteration are so considerably faster than an ASN1 decode
function call that we can generally ignore them for complexity
arguments. So you will likely ASN1 decode (sqrt(N)/256)/2 sessions (ie.
sqrt(N)/512 of them) during a lookup in a full (sub)cache. This is less
than the N/512 you suggested because you forgot the subcaching, so I
think that throws out some of your estimates, though OTOH your
estimates serve as a 1/256th estimate on shmht performance. :-)

Anyway, you need a large N and a full cache for those subcache lookups
to expect *any* false-positives on average, but you can certainly
expect far fewer than you'd get on a linear search over the entire
cache without optimisations like the second-byte test. As for lowering
the number of false-positives in the lookup as you seem to be
suggesting - that's not that hard; rather than taking the second byte
as an optimisation, we could take the fifth, sixth, seventh and eighth
bytes (ie. the second word) and memcpy() them into a "second word"
index that could be tested in lookups. This would reduce the proportion
of false positives from 1/256 to (1/256)^4 which is 1/4294967296. In
other words, you'd get a false positive once every few years if you're
unlucky. However, again - the cache overheads in shmcb are already low
enough that processing in the SSL/TLS stack (even if *all* crypto is
offloaded to a impossibly zero-overhead zero-latency hardware driver)
will always dominate your profile, so eeking a few cycles of
optimisation out of the cache logic (and introducing bug-prone
complexities in to the code) is probably not worth the effort.

> As always, O(sqrt(N)) or O(N) complexities become worse as N
> increases. When N=65536, and the cache is full, the average number of
> comparisons for a session-id lookup is (if I'm calculating correctly)
> 128. When N=131072, the average number of comparisons for a lookup
> grows to 256, and the fact that only 2 bytes are used for the quick
> comparison will start taking its additional toll.

Nope. Unless I'm mistaken, when N=65536 (assuming we have 256 subcaches,
though it could be 128) and the cache is full, the average number of
non-trivial comparisons (ie. ASN1 decodes) will be 1/512th the number
of sessions in the *subcache*, which is 256, which means 0.5 of them.
Ie. assuming the lookups are always for sessions that exist, there's a
25% chance of a false positive for each lookup.

For N=131072 using the same logic (though remember all this is subject
to a factor of up to 2 one way or the other), you'd expect
sqrt(131072)/512 ASN1 decodes during a lookup which is 0.7 - which
means approx 50% chance of a false positive on a ASN1 decode. On the
other hand, you'd expect 65536 false positives on a single lookup using
the shmht cache.

> To put the numbers in perspective (and to show that the example Ns
> above are untypically large), a session cache of size 500Kbytes is
> typically used in Apache, making room for only about N=3000
> session-cache entries. In this case, the typical number of one-byte

That's if your sessions are "normal" (between 100 and 150 bytes each).
If you're using client-certs then each such client-authenticated
session can easily blow out to over 1Kb (the sessions contain the
client-cert encoded within them).

> comparisons a lookup needs is roughly 32 (if I'm calculating this
> correctly). This is certainly not bad, and should give excellent
> results for this algorithm (as indeed I see people are reporting).

Again the subcache mechanism reduces this right down. For a cache that
configures itself for 3000 sessions (it auto-estimates an index
capacity based on the size of the memory allocated for the cache), it
will probably use 32 subcaches with index space for 94 sessions in each
subcache. More or less. So a lookup would have a maximum iteration of
94 items (lookups for sessions that exist in the cache would look at 48
sessions on average) and the second-byte test means you reduce your
ASN1 decodes by a factor of 256 - so in general you're hardly ever
getting false positives.

> While N=3000 is typical, N=65536 is probably not because it results
> in a very big cache, about 10MB. N=131072, 20MB, is probably
> considered ridiculously large. Is this why such N's were considered
> uninteresting and this lookup algorithm was deemed good enough in all
> practical cases?
>
> If anyone is wondering why I'm thinking about this issue, and why
> someone may want more than the 500Kbyte session-cache (N=3000
> sessions), think about the following calculation:

OK, I'll snip it there and make a couple of general observations to
finish. The numbers you start to talk about would bury any kind of
conventional server - and having worked with a lot of crypto hardware I
can assure you that even when offloading *all* crypto operations to
hardware (in a zero-overhead zero-latency utopia) you will still bang
your head against processing overheads in the SSL/TLS logic (the ASN1
code particularly) and of course in the kernel - there's a lot of
networking and memory paging involved, and if you're using hardware,
the driver's going to be busy as hell too.

But the scalability problem goes further - the problem with shmht was
twofold and shmcb has only (so far) addressed one of the problems but
has addressed it well enough that the other problem is not really
observable in "normal" circumstances. The problem shmcb addressed was
the sheer enormity of the put/del/get overheads. These have been
reduced by shmcb using some smoke and mirrors with (a) lookup
optimisations such as the second-byte check, and (b) sub-dividing into
smaller and faster-to-search subcaches based on the session ids. (b) is
possible with session caching because any get/put/del has a fixed
session id to work with and no interoperation is required between
different ranges of session-ids so splitting them up into smaller
subcaches harms nothing.

The problem that shmcb did not address is parallelism - if you ran a
benchmark against a version of mod_ssl using shmht that was built with
profiling, and benchmarked using multiple requests in parallel, I
expect that you'd have found that httpd processes were often waiting on
the mutex lock for the cache. Whilst a single process is doing a (slow)
lookup on the cache, all other processes needing to do a cache
operation will block. That single mutex creates a playoff between cache
latency and server throughput. By reducing the overheads of each cache
operation relative to other SSL/TLS overheads that do *NOT* need a
mutex, shmcb reduces the frequency and severity of this contention
possibility. However, one reason for the subcaching idea in shmcb was
to allow the possibility of multiple mutex locks, conceivably one for
each subcache (though you could mask it too, say, have 64 subcaches
sharing 16 mutexes - 4 subcaches share the same mutex). This would then
increase the parallelism of the session cache so that two processes
needing cache operations, unless they're unlucky enough to be needing
the *same* subcache, can process in parallel.

If you're serious about wanting a *local* session cache with
astronomical scalability, you'll certainly increase your theoretical
limits best by addressing this issue first - as I mentioned earlier the
lookups are already *very* fast compared to other operations in the
SSL/TLS implementation, but unlike the other operations in the SSL/TLS
implementation, *they can't be done in parallel*! :-) This could
conceivably be an issue on SMP machines, for example.

However, I suspect that once you get into these sorts of extra-planetary
demands you're already going to be using multiple servers to handle
loads - apart from the fact you need to be able scale your server
environment without buying a brand new super-computer each time you
need more juice, you also need some redundancy and fault-tolerance;
that means redundant servers, network feeds, network interfaces, HUBS,
etc. When you think what all this can mean to the session caching
problem, I think you'll understand why I'm not particularly worried
about tweaking shmcb much - you need to have a distributed cache
instead. The alternative is to have a load-balancer that can always
match SSL/TLS session-resume requests up to the server they last spoke
to but that sucks for a variety of reasons. One is that you lose part
of the whole point of load-balancing; that you want to balance the load
across servers based on how busy they are and *not* by a requirement to
find the particular server that remembers the client. Another is that
if a server goes down, you can route clients to other servers but you
lose all the session information that was internal to that server.
Another is that you need all the same global state anyway inside your
load-balancer for it to know about that pairing up - so your
load-balancer itself can't scale or work in parallel, and it becomes a
single point of failure and an expensively single point of bottleneck
too (ie. when you hit the limits of your load-balancer, what do you
do?). The only thing worse than a single bottleneck/point-of-failure in
the form of an inexpensive commodity server running open source (free)
software is precisely the same problem in the form of an expensive
proprietary black box running software that you can't tweak, upgrade,
or debug.

OTOH: If your servers use a distributed session cache then you can
load-balance however you please, and in fact you can have multiple
internet feeds coming into your server network with multiple
load-balancers handling each of those feeds - you don't care which
server each client's session-resume request is routed to as all the
servers share a network-wide cache.

Which leads to a shameless plug for *another* cache implementation you
might want to analyse when you start to look beyond the limits of
shmcb;
http://www.distcache.org/

BTW: Apache 2.0.43 support for distcache will be released in a day or
two, as will the first patch kit for Apache 1.3.*+mod_ssl.

Regards,
Geoff

____________________________________________________________ __________
Apache Interface to OpenSSL (mod_ssl) www.modssl.org
User Support Mailing List modssl-users@modssl.org
Automated List Manager majordomo@modssl.org

Re: Question about session-cache algorithms

am 10.10.2002 20:23:13 von nyh

On Thu, Oct 10, 2002, Geoff Thorpe wrote about "Re: Question about session-cache algorithms":
> I suppose as the author of the implementation you're disecting I should
> probably respond in some way :-)

Thanks. Your response was very educating.

> If you check the "shmht" implementation (which is what "shmcb" was
> created to replace) you'll see that a linear search was being conducted

Yes, I didn't mention that, but I had already known that shmcb was a lot
better than shmht, for a number of reasons, which is why I didn't even
try looking into the (non-)scalability of shmht.

I had already read
http://marc.theaimsgroup.com/?l=apache-modssl&m=985310627047 50&w=2
in which you explained the differences between shmcb and shmht - it was
a very good explanation.

> operation, you've got serious trouble. I'm quietly confident that you
> could add a "for(loop=0;loop<1000;loop++){}" loop around the cache

I guess that if that's the case (I have to admit, I didn't do any benchmarking
before asking my question), then indeed even a growth of typical cache
sizes 1000-fold won't make a noticable difference - so indeed there is
no need to worry about this scalability issue - at all.

> subcache of order O(sqrt(N)) to "search" rather than N (I think you
> muddled this up in your analysis). The second-byte tests in the
> subcache iteration are so considerably faster than an ASN1 decode
> function call that we can generally ignore them for complexity
> arguments.

Oh, I counted the second-byte comparisons in my complexity analysis, which
is why I always got 256 times what you got :)

> So you will likely ASN1 decode (sqrt(N)/256)/2 sessions (ie.
> sqrt(N)/512 of them) during a lookup in a full (sub)cache. This is less

Note that the number sqrt(N) is correct only till N=65536. From there
on (if I understand correctly), the number of subcaches remains 256, and
you start having N/256 entries per cache, so on average have something
like N/65536 (not sqrt(N)/256)) ASN1 decodes.

But obviously this only makes a difference for huge Ns, and you convinced
me that these are anyway not as useful as I thought they may be.

> cache without optimisations like the second-byte test. As for lowering
> the number of false-positives in the lookup as you seem to be
> suggesting - that's not that hard; rather than taking the second byte
> as an optimisation, we could take the fifth, sixth, seventh and eighth
> bytes (ie. the second word) and memcpy() them into a "second word"

This will only the number of false-positive ASN1 decodes, not the number
of indices that are compared. But you did convince me that even these
O(sqrt(N)) or O(N/256) comparisons can scale to much more than anything
even high-end servers will want to handle in the next decade :)

> > To put the numbers in perspective (and to show that the example Ns
> > above are untypically large), a session cache of size 500Kbytes is
> > typically used in Apache, making room for only about N=3000
> > session-cache entries. In this case, the typical number of one-byte
>
> That's if your sessions are "normal" (between 100 and 150 bytes each).
> If you're using client-certs then each such client-authenticated
> session can easily blow out to over 1Kb (the sessions contain the
> client-cert encoded within them).

This is a very good point, I forgot about it.

> finish. The numbers you start to talk about would bury any kind of
> conventional server - and having worked with a lot of crypto hardware I
> can assure you that even when offloading *all* crypto operations to
> hardware (in a zero-overhead zero-latency utopia) you will still bang
> your head against processing overheads in the SSL/TLS logic (the ASN1
> code particularly) and of course in the kernel - there's a lot of
> networking and memory paging involved, and if you're using hardware,
> the driver's going to be busy as hell too.

Yes, although I did personally see a demonstration of a server, using an SSL
acceleration card, that could do 500 new SSL sessions per second (note
that all the requests were very quick and came from a LAN - otherwise
Apache couldn't handle this with its 250-process limit).

> The problem that shmcb did not address is parallelism - if you ran a
> benchmark against a version of mod_ssl using shmht that was built with
> profiling, and benchmarked using multiple requests in parallel, I
> expect that you'd have found that httpd processes were often waiting on
> the mutex lock for the cache. Whilst a single process is doing a (slow)
> lookup on the cache, all other processes needing to do a cache

This is a very good point. Naturally, any super-server like I imagined
in my first post will have multiple CPUs, so you're right, this issue
will be more important than whether 10 or 500 bytes are being compared.

> However, I suspect that once you get into these sorts of extra-planetary
> demands you're already going to be using multiple servers to handle
> loads - apart from the fact you need to be able scale your server
> environment without buying a brand new super-computer each time you
> need more juice, you also need some redundancy and fault-tolerance;

Again, you're right.
I'm convinced - there's no need to change shmcb :)

> Which leads to a shameless plug for *another* cache implementation you
> might want to analyse when you start to look beyond the limits of
> shmcb;
> http://www.distcache.org/
>
> BTW: Apache 2.0.43 support for distcache will be released in a day or
> two, as will the first patch kit for Apache 1.3.*+mod_ssl.

Thanks, I'll look at it.

Thanks for explaining to me how silly my worries about shmcb's scalability
were (oh, and thanks for writing it in the first place :) ).

Nadav.

--
Nadav Har'El | Thursday, Oct 10 2002, 5 Heshvan 5763
nyh@math.technion.ac.il |-----------------------------------------
Phone: +972-53-245868, ICQ 13349191 |"Outlook not so good." Wow! That magic 8-
http://nadav.harel.org.il |ball knows everything! So, what about IE?
____________________________________________________________ __________
Apache Interface to OpenSSL (mod_ssl) www.modssl.org
User Support Mailing List modssl-users@modssl.org
Automated List Manager majordomo@modssl.org