Need advice on a good setup for "generic queries"

am 12.07.2009 18:39:47 von Morten

Hi,

I'm working on a table that has about 12 columns against which
arbitrary queries must perform really well. Currently there are a lot
of indexes on the table, but I'm hitting some problems - and adding
more indexes seems a slippery slope (there are ~15 multi-column
indexes, I'd like that reduced).

So I'm looking for a way out and I'm currently considering:

* Building a memory table on top of the existing table
* Sphinx indexing and then throw the queries against Sphinx instead
* Using a different "in-memory-DB" like Tokyo Cabinet for the queries
* Building a series of "reporting tables" which each handle a subset
of the supported queries

All of the solutions would maintain the current table for consistency
and it's acceptable with a couple of minutes lag.

I'm tempted to go for the memory table and update that depending on
which rows have been updated in the parent table since last update.
Eliminating duplicates could be a challenge, unless I build a new
table for each update and then "rename" the tables - but that's costly
in terms of memory.

What do people usually do in this situation? Any other solutions to
consider?

Thanks,

Morten

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Need advice on a good setup for "generic queries"

am 13.07.2009 03:06:12 von Morten

Mike, you're right - sorry.

I've been reading High Performance MySQL today and got some great tips
from that which will help a lot. I think the fundamental challenge
now, is that the table contains a lot of timestamps, and querying
against these involves multiple range queries which makes indexing hard.

The "actions" table has the following columns (of relevance to the
example):

status_id
assignee_id
company_id
created_at
assigned_at
opened_at
updated_at
verified_at
due_at
solved_at
closed_at

Queries could be:

"Show all actions which are assigned to Tom, were created in
October and solved in November"
"Show all open actions which were opened before August, do not have
an assignee and were verified last week"

These queries which involve easily indexable fields (status_id,
assignee_id, company_id) and multiple conditions on different ranges
are what's difficult. The table is about 2.500.000 records and grows
at a daily rate of about 50.000 records (that number is growing
though). Once an action has been closed, it gets status "closed" and
is no longer of interest. 70% of the records in the table will be
status "closed".

I think what I'm looking for now, is some way to encode the different
date values into a single column which can be indexed and the value of
which gets calculated and updated by a background job. This will cost
some precision, but I hope that can be done. Otherwise I'm back to
considering alternative index/query-mechanisms.

Does my problem make a little more sense now? Thanks.

Morten

Let's say I would like to see all actions that were created in october
and solved in november.

On Jul 12, 2009, at 3:54 PM, mos wrote:

> Morten,
> Perhaps you could also add how many rows are in the table, how
> many rows are added each day, what are the column types, and what do
> the search queries look like?
>
> Mike
>
> At 11:39 AM 7/12/2009, Morten wrote:
>
>> Hi,
>>
>> I'm working on a table that has about 12 columns against which
>> arbitrary queries must perform really well. Currently there are a lot
>> of indexes on the table, but I'm hitting some problems - and adding
>> more indexes seems a slippery slope (there are ~15 multi-column
>> indexes, I'd like that reduced).
>>
>> So I'm looking for a way out and I'm currently considering:
>>
>> * Building a memory table on top of the existing table
>> * Sphinx indexing and then throw the queries against Sphinx instead
>> * Using a different "in-memory-DB" like Tokyo Cabinet for the queries
>> * Building a series of "reporting tables" which each handle a subset
>> of the supported queries
>>
>> All of the solutions would maintain the current table for consistency
>> and it's acceptable with a couple of minutes lag.
>>
>> I'm tempted to go for the memory table and update that depending on
>> which rows have been updated in the parent table since last update.
>> Eliminating duplicates could be a challenge, unless I build a new
>> table for each update and then "rename" the tables - but that's
>> costly
>> in terms of memory.
>>
>> What do people usually do in this situation? Any other solutions to
>> consider?
>>
>> Thanks,
>>
>> Morten
>>
>>
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe: http://lists.mysql.com/mysql?unsub=mos99@fastmail.fm
>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe: http://lists.mysql.com/mysql?unsub=my.lists@mac.com
>

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Need advice on a good setup for "generic queries"

am 13.07.2009 06:13:54 von mos

At 08:06 PM 7/12/2009, Morten wrote:

>Mike, you're right - sorry.
>
>I've been reading High Performance MySQL today and got some great tips
>from that which will help a lot.

Yes it is a good book. I hope you have the 2nd edition.

> I think the fundamental challenge
>now, is that the table contains a lot of timestamps, and querying
>against these involves multiple range queries which makes indexing hard.

If you can get rid of the DateTime and switch to just Date it may speed up
the indexes.

>The "actions" table has the following columns (of relevance to the
>example):
>
> status_id
> assignee_id
> company_id
> created_at
> assigned_at
> opened_at
> updated_at
> verified_at
> due_at
> solved_at
> closed_at
>
>Queries could be:
>
> "Show all actions which are assigned to Tom, were created in
>October and solved in November"
> "Show all open actions which were opened before August, do not have
>an assignee and were verified last week"
>
>These queries which involve easily indexable fields (status_id,
>assignee_id, company_id) and multiple conditions on different ranges
>are what's difficult. The table is about 2.500.000 records and grows
>at a daily rate of about 50.000 records (that number is growing
>though). Once an action has been closed, it gets status "closed" and
>is no longer of interest. 70% of the records in the table will be
>status "closed".

So why not have 2 tables: "Cases_Active" for the cases currently open, and
"Cases_Closed".
This will greatly reduce the number of rows in the table and the depth of
the index tree. Your application will select the appropriate table if the
question mentions "Active" or "Closed" cases. You can also define a Merge
table "Cases_All" that will logically combine both tables so you could do a
query on both closed and active cases if you need to, or just do a Union on
the two tables via 2 separate queries.

You didn't mention how many queries per second you need to handle. A couple
hundred queries per second can be handled by MyISAM because it has
excellent query cache that can be tweaked. You can also load the indexes
into memory if needed.

The main problem as you mentioned is trying to index all of the date
fields. I don't think this is necessary. I noticed in 5.1 MySQL (because it
was rushed out the door) does not always use the correct index and I often
have to force it to use the proper index with "Force Index". So you need to
use the Explain on your slow queries (make sure you log them) and find out
which index they are using.

You can use a Memory table but they don't work with Merge tables. I would
only consider this if the table gets updates every second or two and that
flushes the cache. I'm not sure how much money you're willing to throw at
this project, but I know of some additional hardware that can squeeze out
more speed.

>I think what I'm looking for now, is some way to encode the different
>date values into a single column which can be indexed and the value of
>which gets calculated and updated by a background job. This will cost
>some precision, but I hope that can be done. Otherwise I'm back to
>considering alternative index/query-mechanisms.
>
>Does my problem make a little more sense now? Thanks.

Yup! :-)

Mike

>Morten
>
>
>
>Let's say I would like to see all actions that were created in october
>and solved in november.
>
>
>On Jul 12, 2009, at 3:54 PM, mos wrote:
>
>>Morten,
>> Perhaps you could also add how many rows are in the table, how
>>many rows are added each day, what are the column types, and what do
>>the search queries look like?
>>
>>Mike
>>
>>At 11:39 AM 7/12/2009, Morten wrote:
>>
>>>Hi,
>>>
>>>I'm working on a table that has about 12 columns against which
>>>arbitrary queries must perform really well. Currently there are a lot
>>>of indexes on the table, but I'm hitting some problems - and adding
>>>more indexes seems a slippery slope (there are ~15 multi-column
>>>indexes, I'd like that reduced).
>>>
>>>So I'm looking for a way out and I'm currently considering:
>>>
>>>* Building a memory table on top of the existing table
>>>* Sphinx indexing and then throw the queries against Sphinx instead
>>>* Using a different "in-memory-DB" like Tokyo Cabinet for the queries
>>>* Building a series of "reporting tables" which each handle a subset
>>>of the supported queries
>>>
>>>All of the solutions would maintain the current table for consistency
>>>and it's acceptable with a couple of minutes lag.
>>>
>>>I'm tempted to go for the memory table and update that depending on
>>>which rows have been updated in the parent table since last update.
>>>Eliminating duplicates could be a challenge, unless I build a new
>>>table for each update and then "rename" the tables - but that's
>>>costly
>>>in terms of memory.
>>>
>>>What do people usually do in this situation? Any other solutions to
>>>consider?
>>>
>>>Thanks,
>>>
>>>Morten
>>>
>>>
>>>
>>>--
>>>MySQL General Mailing List
>>>For list archives: http://lists.mysql.com/mysql
>>>To unsubscribe: http://lists.mysql.com/mysql?unsub=mos99@fastmail.fm
>>
>>
>>--
>>MySQL General Mailing List
>>For list archives: http://lists.mysql.com/mysql
>>To unsubscribe: http://lists.mysql.com/mysql?unsub=my.lists@mac.com
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=mos99@fastmail.fm

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Need advice on a good setup for "generic queries"

am 13.07.2009 16:35:38 von Morten

>> I've been reading High Performance MySQL today and got some great
>> tips
>> from that which will help a lot.
>
> Yes it is a good book. I hope you have the 2nd edition.

I do, I should have read this years ago (well.. the 1st edition then
at least). So many caveats to using indexes.

> So why not have 2 tables: "Cases_Active" for the cases currently
> open, and "Cases_Closed".

Reporting across open and closed, but as you state I could be using
UNION for this. Reporting is not expected to be fast any way.

> You can use a Memory table but they don't work with Merge tables. I
> would only consider this if the table gets updates every second or
> two and that flushes the cache. I'm not sure how much money you're
> willing to throw at this project, but I know of some additional
> hardware that can squeeze out more speed.

Well.. I could just throw some more RAM at it. But ideally, I would
have a sound setup first before considering mindlessly adding
resources (however tempting it is).

Thanks for your tips. I'll be looking further into splitting the tables.

Morten

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Need advice on a good setup for "generic queries"

am 13.07.2009 22:37:04 von Simon J Mudd

mos99@fastmail.fm (mos) writes:

> At 08:06 PM 7/12/2009, Morten wrote:
>
> If you can get rid of the DateTime and switch to just Date it may
> speed up the indexes.

While not as pretty it's more compact to convert timestamp values into
an bigint. For example: seconds since epoch. If you know the ranges
to put in the query then store them this way and thus save on some
storage, and therefore improve performance. May be worth considering?

....

> >These queries which involve easily indexable fields (status_id,
> >assignee_id, company_id) and multiple conditions on different ranges
> >are what's difficult. The table is about 2.500.000 records and grows
> >at a daily rate of about 50.000 records (that number is growing
> >though). Once an action has been closed, it gets status "closed" and
> >is no longer of interest. 70% of the records in the table will be
> >status "closed".

As mentioned if you are not interested in "closed" queries get rid of them.
put them in another table.

That reduces the number of rows and hence the query time.

Simon

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org