patches available for DBD::Multi, and benchmarking results

am 19.01.2007 23:29:00 von mark

Some other folks may be interested in the patches I've published for
DBD::Multi recently:

http://rt.cpan.org/Public/Dist/Display.html?Name=DBD-Multi

Perhaps most interesting might be a benchmarking script I made, and
related optimization patches:

http://rt.cpan.org/Ticket/Display.html?id=24460

Unfortunately, my benchmarks found the raw overhead of DBD::Multi be
significant. My patches reduced the overhead from 212% to 187%...

I suspect the overhead is much, much lower in a real-world scenario, but
I still get the sense there is further room for improvement.

Do people have other solutions for load-balancing and failover that they
recommend, at least for read-only access?

Today I ran across

ResourcePool
http://search.cpan.org/dist/ResourcePool/

and

SQL-Relay
http://sqlrelay.sourceforge.net/

But I find few references to people actually using these tools. I'm also
trying to weigh if my efforts to boost performance might be better spent
on a caching solutions, including memcached, or some form of HTTP-level
caching...

Thanks for your feedback!

Mark

Re: patches available for DBD::Multi, and benchmarking results

am 22.01.2007 13:42:05 von henri

On Jan 20, 2007, at 12:29 AM, Mark Stosberg wrote:

>
> Some other folks may be interested in the patches I've published for
> DBD::Multi recently:
>
> http://rt.cpan.org/Public/Dist/Display.html?Name=DBD-Multi
>
> Perhaps most interesting might be a benchmarking script I made, and
> related optimization patches:
>
> http://rt.cpan.org/Ticket/Display.html?id=24460
>
> Unfortunately, my benchmarks found the raw overhead of DBD::Multi be
> significant. My patches reduced the overhead from 212% to 187%...
>
> I suspect the overhead is much, much lower in a real-world
> scenario, but
> I still get the sense there is further room for improvement.
>
> Do people have other solutions for load-balancing and failover that
> they
> recommend, at least for read-only access?

I humbly repeat my earlier suggestion to use DBIx::HA.
The difference between DBIx::HA and DBD::Multi is that DBD::Multi
completely abstracts your handles. It needs to connect the handle
created between the user and DBD::Multi, and the handle(s) between
DBD::Multi and the other DBDs. So everything you do has to be
dereferenced at least once. DBIx::HA doesn't do that. It has some
overhead in the connect/prepare/execute phases to make sure there's
proper failover when necessary, but otherwise leaves the fetches
completely alone.

DBIx::HA is certainly not the panacea (more below) and it sucks to be
doing high availability as a subclass of DBI, but it's better than
doing brute force dereferencing by having a real statement handle as
an object reference to a fake statement handle created by DBD::Multi.
I believe that the only reason DBD::Multi was done this way was
because DBI didn't have swap_inner_handle(). DBI now already does
this dual-handle business, so you're better off using that through a
subclass instead of adding yet another layer. Of course if DBI itself
had all the available API calls to do load balancing on the inner
handles it would be better than both DBIx::HA and DBD::Multi. In
effect, DBI would have built-in high availability (HA) in the C code
instead of an external Perl subclass.

However, that still doesn't solve the connection pooling problem that
surfaces shortly after a user has the need for load balancing and
failover. As long as Perl doesn't have excellent thread management,
proper connection pooling can't be done in the DBI codebase (however,
there are alternate solutions; see below).

> Today I ran across
>
> ResourcePool
> http://search.cpan.org/dist/ResourcePool/

ResourcePool is a generic resource pooling framework. Considering
that Apache::DBI already exists, if you use it in conjunction with
Apache/mod_perl and iThreads you'll get the exact same functionality.
You don't gain much by using ResourcePool for DBI resources.
Also ResourcePool only works in iThreads, so no cross-process pooling.

>
> and
>
> SQL-Relay
> http://sqlrelay.sourceforge.net/

SQL-Relay is yet another similar approach to high availability. It
does load balancing, failover and connection pooling and caching. It
works by having the user connect to SQL-Relay which then calls the
database, grabs results and returns them to the user. It attempts to
be as close to a direct database call as possible, so it needs
drivers for every database engine, and it needs to map as many API
calls as possible, including the low-level calls. In effect, it's
like DBI but in C, and has support for query caching. In order to do
connection pooling, it uses many listener processes (those that talk
to the client) and many connections processes (those that talk to the
database), and has all these creatures talking to each other through
shared memory. If you're comfortable with such an approach, go ahead.
It does seem that the Sybase driver is rather immature, but the
Oracle driver and probably the MySQL one are seeing heavy use and
development. SQL-Relay is very powerful, and it may work well enough
for you.

> But I find few references to people actually using these tools. I'm
> also
> trying to weigh if my efforts to boost performance might be better
> spent
> on a caching solutions, including memcached, or some form of HTTP-
> level
> caching...

I don't know why you're talking high availability while looking for
performance boosting. Those two are completely orthogonal, except
insofar as the more functionality you want, generally the slower
things are.

Connection pooling is a somewhat different beast than HA, part
performance boosting and part resource minimization. None of the Perl-
specific solutions today address that problem properly, simply
because you can't pass database connection handles between processes
(not Perl-specific), nor is Perl's iThreads implementation robust
enough for enterprise deployment where high performance/availability
is critical.

Some of us are working on a solution to connection pooling that would
allow for easy query caching, but it will have enough restrictions
that it won't be usable in all cases. More when things coalesce.

For performance boosting, you can certainly use memcached to cache
all or parts of web pages at the app server level. Anyway, I am sure
that a number of articles, howtos and discussions regarding web app
performance boosting can be found online.

H

Re: patches available for DBD::Multi, and benchmarking results

am 22.01.2007 15:42:54 von mark

Henri,

Thanks again for another prompt and extremely helpful response.

Henri Asseily wrote:
>
> I humbly repeat my earlier suggestion to use DBIx::HA.

I'm going to try it again today, using the same benchmark system. I'll
report back the results.

I started looking at DBD::Multi again because it load balancing
built-in, even though it should be easy to do it with DBIx::HA, as
you've described. I also realized that at least initially, I won't
really be doing load balancing-- I'll just be sending some SELECTs to a
slave to reduce the load on the master, which must keep accepting all
the write activity.

I was also put off by the discovery that it doesn't support "do()".
Later, as I thought about it more, the obvious became clear: I won't
ever be using "do()" with SELECT statements, so that didn't matter.

> ResourcePool
>> http://search.cpan.org/dist/ResourcePool/
>
> ResourcePool is a generic resource pooling framework. Considering that
> Apache::DBI already exists, if you use it in conjunction with
> Apache/mod_perl and iThreads you'll get the exact same functionality.
> You don't gain much by using ResourcePool for DBI resources.
> Also ResourcePool only works in iThreads, so no cross-process pooling.
>
>> SQL-Relay
>> http://sqlrelay.sourceforge.net/
>
> SQL-Relay is yet another similar approach to high availability. It
> does load balancing, failover and connection pooling and caching. It
> works by having the user connect to SQL-Relay which then calls the
> database, grabs results and returns them to the user. It attempts to
> be as close to a direct database call as possible, so it needs drivers
> for every database engine, and it needs to map as many API calls as
> possible, including the low-level calls. In effect, it's like DBI but
> in C, and has support for query caching. In order to do connection
> pooling, it uses many listener processes (those that talk to the
> client) and many connections processes (those that talk to the
> database), and has all these creatures talking to each other through
> shared memory. If you're comfortable with such an approach, go ahead.
> It does seem that the Sybase driver is rather immature, but the Oracle
> driver and probably the MySQL one are seeing heavy use and
> development. SQL-Relay is very powerful, and it may work well enough
> for you.

These are most useful descriptions of these tools that I've found.
SQL-Relay does sound like a possibility for me, but I'll stick with a
simpler solution such as DBIx::HA if it works.

Mark

Re: patches available for DBD::Multi, and benchmarking results

am 22.01.2007 20:51:48 von mark

I expanded by benchmark of DBD::Multi to also include DBIx::HA for
comparison.

While the benchmark reported nearly a 200% overhead for DBD::Multi,
the overhead of DBIx::HA was much better, reported at 50%.

That sounds like a lot in both cases, but is not necessarily. First, my
test case is not normal-- the SELECT statement used is incredibly
simple. Second, even the slowest one still achieved over 2,000 selects
*per second*.

Still, if both modules offer functionality that meets your needs,
why not go with higher-performance solution?

Here's my raw result:

timing 10000 iterations of ha, multi, raw...
ha: 4 wallclock secs ( 1.98 usr + 0.20 sys = 2.19 CPU) @ 4571.43/s
(n=10000)
multi: 7 wallclock secs ( 3.97 usr + 0.23 sys = 4.20 CPU) @ 2383.61/s
(n=10000)
raw: 3 wallclock secs ( 1.28 usr + 0.19 sys = 1.47 CPU) @ 6808.51/s
(n=10000)
Rate multi ha raw
multi 2384/s -- -48% -65%
ha 4571/s 92% -- -33%
raw 6809/s 186% 49% --

#############

The meat of the meat of the benchmark was essentially the following. In
each case, I made a connection to a single database:

cmpthese(10_000, {
raw => sub { $raw_dbh->selectrow_array("SELECT CURRENT_DATE") },
multi => sub { $multi_dbh->selectrow_array("SELECT CURRENT_DATE") },
ha => sub { $ha_dbh->selectrow_array("SELECT CURRENT_DATE") },
});

##############

In the process of developing this benchmark, I did find an
incompatibility with DBD::Pg, which I notified the DBD::Pg developers
about, and submitted a patch for:

http://rt.cpan.org/Ticket/Display.html?id=24503

Mark