selecting data for computation

selecting data for computation

am 24.06.2010 20:33:59 von Tim Gustafson

I'm working on a compute job that queries mySQL for work packets,
processes the work packet and then updates the row in mySQL with the
results of the job.

Currently, from each compute instance I'm doing something like:

select * from work_packets where work_data is null order by rand() limit 1;

That more or less works, and occasionally two jobs will process the same
row and I'll waste a little bit of CPU time, but as a higher percentage of
jobs get completed, duplicate work happens more and more.

I was looking at something like:

select * from work_packets where work_data is null order by rand() limit 1
for update;

and

select * from work_packets where work_data is null order by rand() limit 1
lock in share mode;

but those aren't exactly what I need - they don't cause the second compute
instance to move on to another row; instead they cause the second compute
job to wait for the first one to finish and -then- they re-do the same
work that the first job just completed.

I also thought about doing something like:

select * from work_packets where work_data is null and (select
get_lock(work_packets.id, 0) = 1 order by rand() limit 1;

but that seems to significantly increase the time it takes to run this
query from about 2 seconds to about 13 seconds. Since the jobs only take
about 10 seconds of CPU time each, this represents in an actual slow-down
as opposed to a speed-up.

Is there any sort of query that I can do that will return a single row to
me that another compute job hasn't already started working on, other than
doing something like:

select * from work_packets where work_data is null and working = 0 order
by rand() limit 1 for update;
update work_packets set working = 1 where id = ;

The reason I don't want to use this approach is that sometimes the compute
jobs die in the middle of their work, which would leave these records
without any work_data, but marked as "working", and those records would
have to be manually cleaned up at some point.

I feel like I'm missing something really basic here, but I can't seem to
find the answer. Does anyone have any suggestions?

Tim Gustafson
831-332-1496
tjg@tgustafson.com
http://tgustafson.com/


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: selecting data for computation

am 24.06.2010 20:59:17 von Dan Nelson

In the last episode (Jun 24), Tim Gustafson said:
> I'm working on a compute job that queries mySQL for work packets,
> processes the work packet and then updates the row in mySQL with the
> results of the job.
>
> Currently, from each compute instance I'm doing something like:
>
> select * from work_packets where work_data is null order by rand() limit 1;
>
> That more or less works, and occasionally two jobs will process the same
> row and I'll waste a little bit of CPU time, but as a higher percentage of
> jobs get completed, duplicate work happens more and more.
[..]
> I also thought about doing something like:
>
> select * from work_packets where work_data is null and (select
> get_lock(work_packets.id, 0) = 1 order by rand() limit 1;
>
> but that seems to significantly increase the time it takes to run this
> query from about 2 seconds to about 13 seconds. Since the jobs only take
> about 10 seconds of CPU time each, this represents in an actual slow-down
> as opposed to a speed-up.

That query won't execute (the subquery expression is incomplete); if you
really are using a subquery there, try it plain, with no order by rand():

select * from work_packets where work_data is null and get_lock(id,0)=1 limit 1;

My guess is your initial reason for ORDER BY RAND() was to minimize
duplicate work, but if you're using locks it'll never happen, so you might
as well just grab the first available packet. That query shouldn't even
take 2 seconds, unless you have 50k rows in the table and no index on
work_data...

--
Dan Nelson
dnelson@allantgroup.com

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: selecting data for computation

am 25.06.2010 01:52:58 von Tim Gustafson

> select * from work_packets where work_data is null and get_lock(id,0)=1
> limit 1;

Sorry, I should have said that my example was pseudo-code that I was
typing from memory. The query you've suggested is certainly an
improvement - I don't know why I though that the get_lock needed to be in
a subquery, but yours is certainly faster.

> My guess is your initial reason for ORDER BY RAND() was to minimize
> duplicate work, but if you're using locks it'll never happen, so you might
> as well just grab the first available packet. That query shouldn't even
> take 2 seconds, unless you have 50k rows in the table and no index on
> work_data...

Your guess is correct. I've re-written the query more or less as you've
suggested and it now takes 4 seconds to complete, which I think is
acceptable.

There is not an index on the work_data column as it's a longblob and I was
under the impression that indexing a longblob field wasn't helpful. Maybe
I should add a work_data_size field as an integer, index that, and search
for records where work_data_size = 0.

For what it's worth, the table currently has 800,000 rows in it and is
likely to expand in the future.

Tim Gustafson
831-332-1496
tjg@tgustafson.com
http://tgustafson.com/


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: selecting data for computation

am 25.06.2010 02:19:10 von Tim Gustafson

Hrmm, something didn't work quite right. Here's the actual query:

select x, y, zoom from map_images where image_data_size = 0 and
get_lock(concat(x, ',', y, ',', zoom), 0) = 1 order by zoom
, sqrt(pow(x, 2) + pow(y, 2)) limit 1

I've added the image_data_size column and indexed it, and set its current
value to length(image_data), and modified my "update" query to set
image_data_size correctly once the job is done.

When I run more than one compute job, they all start processing the same
job simultaneously. What am I missing?

Tim Gustafson
831-332-1496
tjg@tgustafson.com
http://tgustafson.com/


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: selecting data for computation

am 25.06.2010 19:33:25 von Dan Nelson

In the last episode (Jun 24), Tim Gustafson said:
> Hrmm, something didn't work quite right. Here's the actual query:
>
> select x, y, zoom from map_images where image_data_size = 0 and
> get_lock(concat(x, ',', y, ',', zoom), 0) = 1 order by zoom
> , sqrt(pow(x, 2) + pow(y, 2)) limit 1
>
> I've added the image_data_size column and indexed it, and set its current
> value to length(image_data), and modified my "update" query to set
> image_data_size correctly once the job is done.
>
> When I run more than one compute job, they all start processing the same
> job simultaneously. What am I missing?

That shouldn't happen, since get_lock() will only succeed for one mysql
connection at a time. If you can easily edit the code that is running these
queries, can you add a "select is_used_lock('x,y,zoom')" query right after
your first query, and another one just before your compute client exits?
(You'll have to build the "x,y,zoom" string from the values returned by the
first query) It'd be nice if there was an information_schema table that
listed all currently-held locks, but there isn't :(

is_used_lock() will return the mysql connection ID of the session holding
the lock, so what I am expecting to see is zeros, meaning that for some
reason your mysql connection is getting dropped after your first query
completed. Locks are held for the lifetime of a connection, so if you
disconnect, your lock disappears.

--
Dan Nelson
dnelson@allantgroup.com

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: selecting data for computation

am 25.06.2010 21:05:01 von Joerg Bruehe

Hi Tim, all!


Tim Gustafson wrote:
> [[...]]
>=20
> There is not an index on the work_data column as it's a longblob and I =
was
> under the impression that indexing a longblob field wasn't helpful. Ma=
ybe
> I should add a work_data_size field as an integer, index that, and sear=
ch
> for records where work_data_size =3D 0.

I would add some "status" field with (at least) three values, signifying
"new", "in progress", "done"; and for that I would create an index.

Then, let your application grab a record whose status is "new" and
change that to "in progress", do the computation, and store the result
and set status "done".
If you worry about aborting computations, add some date/time field which
you set when you start the computation, and periodically do a cleanup
searching for "in progress" records which are overdue.

>=20
> For what it's worth, the table currently has 800,000 rows in it and is
> likely to expand in the future.

That is my reason for proposing some column which can be indexed.


Jörg

--=20
Joerg Bruehe, MySQL Build Team, Joerg.Bruehe@Sun.COM
Sun Microsystems GmbH, Komturstrasse 18a, D-12099 Berlin
Geschaeftsfuehrer: Juergen Kunz
Amtsgericht Muenchen: HRB161028


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: selecting data for computation

am 26.06.2010 01:14:48 von Tim Gustafson

> That shouldn't happen, since get_lock() will only succeed for one mysql
> connection at a time.

I figured it out.

Because there was an order by clause, the get_lock function was being
called for each row in the result set, and since each call to get_lock
releases any previous locks you had, mySQL was effectively locking and
then immediately unlocking (during the sort) each row in the result set.

I wound up moving the get_lock call into a second query, like this:

set i = 0;

select * from foo order by blah limit i, 1;

and then:

select get_lock(blah, 0);

and if that returns 0, do the processing, otherwise do:

i = i + 1; continue;

and that seems to be working like a charm now.

Thanks for your help!

Tim Gustafson
831-332-1496
tjg@tgustafson.com
http://tgustafson.com/


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: selecting data for computation

am 28.06.2010 22:20:20 von Baron Schwartz

Tim,

What you have described is really a queue, although you are not seeing
it from that angle. This is a very common pattern in this kind of
usage. It's also tricky to make it correct (as you have discovered),
and both performance and scalability are problems in most
implementations.

A general-case good design for this is to have status fields, as
suggested, and who-owns-this-row fields. Set the rows initially to
un-owned (NULL), and status 'new' or similar. Then use a query such
as

update

set status = 'claimed', owner=connection_id() where
status = 'new' and owner is null limit 1;

Do that with autocommit enabled, so it does not hold locks longer than
needed. (You will certainly need to use InnoDB for this to work.) If
the statement affected any rows, then you just claimed a row, and you
can go query for the row and do the work, then mark it done. As
previously suggested, add a timestamp column and periodically look for
rows that got claimed but not processed within some amount of time,
due to crashes or bugs or what have you. Set those back to
new/unclaimed state.

After completing the jobs, move them to another table or just delete
them. Do not let this table grow full of historic data. It will
become a big performance problem if you do.

- Baron

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: selecting data for computation

am 01.07.2010 18:22:53 von Tim Gustafson

> A general-case good design for this is to have status fields, as
> suggested, and who-owns-this-row fields. Set the rows initially to
> un-owned (NULL), and status 'new' or similar.

The "get_lock" approach was ultimately successful, and I think better than
using a status field as it is entirely self-managing. If one of the
compute clients die, its locks are immediately released and another
compute client can immediately pick up where it left off. I just had to
separate the "get_lock" query from the "select" query for it to work.

I also optimized the select query by adding a "distance" column, and
setting that to "sqrt(pow(x, 2) + pow(y, 2))" and then ordering the select
clause by that column as opposed to running the function each time and
that sped up the select query to less than a second per run.

I also plan on changing the limit clause from "limit %d, 1" to something
like "limit 500" and then iterating through the results until I get one
that succeeds, as that is more efficient than running multiple select
queries.

Anyhow, thanks for all the good suggestions and pointers - it all worked
out in the end!

Tim Gustafson
831-332-1496
tjg@tgustafson.com
http://tgustafson.com/


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org