MySQL and set complements

MySQL and set complements

am 08.07.2011 03:50:57 von Leonardo Borges

--bcaec52154cf9601cd04a7851056
Content-Type: text/plain; charset=ISO-8859-1

Hello everyone,

I have an increasingly popular web application running on top of mysql and
due to its popularity, I'm running into performance issues. After carefully
examining database indexes and tuning queries I was able to pin down the
slowest part of the system.

The app's got a user segmentation tool that allows you to filter users based
on a range of criteria from which the slowest is: "Select all users that did
not receive the email of id 100"

To answer this question we turn to the activities table, which is basically
a denormalized log of actions taken by the user in this format:
user_id | activity | email_id | ...
10 | email_sent | 100 | ...
10 | subscribed | NULL | ...
10 | email_open | 100 | ...


Given this table and the question above, the usual way of finding out all
users who did not receive this email is through the use of a left outer
join, such as:

select u.id
from users u
left outer join activities a
on u.id = a.user_id
and a.activity = 'email_sent'
and a.email_id = 100
where a.user_id is null

That's all fine for medium-ish tables. However our current activities table
has over 13 million rows, slowing the hell out of this left outer join,
taking about 52 seconds in my machine.

What this query is trying to do is to get the relative complement of set
A(users) to B(activities). As far as I know mysql doesn't support set
subtraction, thus the reason for these queries being slow.

Based on that I've setup a test database on Postgresql, which supports this
very set operation and rewrote the query to look like this:

select u.id
from users u
except
select a.user_id
from activities a
where a.activity = 'email_sent'
and a.email_id = 100;

The fact that postgresql knows how to subtract sets brought this query down
to only 4 seconds.

My question then is: since this is a somewhat common query in our system,
are there any workarounds I could use in mysql to improve things?

I did find one myself, but it's a bit convoluted and might not perform well
under load, but the following sql script gives me similar performance in
mysql:

create temporary table email_sent_100
select a.user_id
from user_activity_events a
where a.activity = 'email_sent'


and a.email_id = 100;

create index user_id_idx on email_sent_100(user_id); //this could
potentially bring the runtime down in the case of a larg temp table.

select count(u.id)
from users u
left outer join email_sent_100 s
on u.id = s.user_id
and s.user_id is null;

A lot more lines and a lot more complex, but does the job in this example.

I'd appreciate your thoughts.

Cheers,
Leonardo Borges
www.leonardoborges.com

--bcaec52154cf9601cd04a7851056--

Re: MySQL and set complements

am 08.07.2011 03:58:14 von Johnny Withers

--000e0cd342b69faccf04a7852a4d
Content-Type: text/plain; charset=ISO-8859-1

Can you post show create table for activity and explain output of the
problem query?

On Jul 7, 2011 8:51 PM, "Leonardo Borges"
wrote:

Hello everyone,

I have an increasingly popular web application running on top of mysql and
due to its popularity, I'm running into performance issues. After carefully
examining database indexes and tuning queries I was able to pin down the
slowest part of the system.

The app's got a user segmentation tool that allows you to filter users based
on a range of criteria from which the slowest is: "Select all users that did
not receive the email of id 100"

To answer this question we turn to the activities table, which is basically
a denormalized log of actions taken by the user in this format:
user_id | activity | email_id | ...
10 | email_sent | 100 | ...
10 | subscribed | NULL | ...
10 | email_open | 100 | ...


Given this table and the question above, the usual way of finding out all
users who did not receive this email is through the use of a left outer
join, such as:

select u.id
from users u
left outer join activities a
on u.id = a.user_id
and a.activity = 'email_sent'
and a.email_id = 100
where a.user_id is null

That's all fine for medium-ish tables. However our current activities table
has over 13 million rows, slowing the hell out of this left outer join,
taking about 52 seconds in my machine.

What this query is trying to do is to get the relative complement of set
A(users) to B(activities). As far as I know mysql doesn't support set
subtraction, thus the reason for these queries being slow.

Based on that I've setup a test database on Postgresql, which supports this
very set operation and rewrote the query to look like this:

select u.id
from users u
except
select a.user_id
from activities a
where a.activity = 'email_sent'
and a.email_id = 100;

The fact that postgresql knows how to subtract sets brought this query down
to only 4 seconds.

My question then is: since this is a somewhat common query in our system,
are there any workarounds I could use in mysql to improve things?

I did find one myself, but it's a bit convoluted and might not perform well
under load, but the following sql script gives me similar performance in
mysql:

create temporary table email_sent_100
select a.user_id
from user_activity_events a
where a.activity = 'email_sent'


and a.email_id = 100;

create index user_id_idx on email_sent_100(user_id); //this could
potentially bring the runtime down in the case of a larg temp table.

select count(u.id)
from users u
left outer join email_sent_100 s
on u.id = s.user_id
and s.user_id is null;

A lot more lines and a lot more complex, but does the job in this example.

I'd appreciate your thoughts.

Cheers,
Leonardo Borges
www.leonardoborges.com

--000e0cd342b69faccf04a7852a4d--

Re: MySQL and set complements

am 08.07.2011 04:07:09 von Leonardo Borges

--bcaec520e99d8b7fb004a7854a23
Content-Type: text/plain; charset=ISO-8859-1

Sure can:

show create table activities;

CREATE TABLE `activities` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`user_id` int(11) DEFAULT NULL,
`email` varchar(100) DEFAULT NULL,
`country_iso` varchar(2) DEFAULT NULL,
`tags` varchar(255) DEFAULT NULL,
`postcode` int(11) DEFAULT NULL,
`activity` varchar(100) DEFAULT NULL,
`page_id` int(11) DEFAULT NULL,
`donation_frequency` varchar(100) DEFAULT NULL,
`email_id` int(11) DEFAULT NULL,
`created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
CURRENT_TIMESTAMP,
PRIMARY KEY (`id`),
KEY `user_idx` (`user_id`),
KEY `email_idx` (`email_id`),
KEY `activity_idx` (`activity`)
) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1


And the explain:

+----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
| id | select_type | table | type | possible_keys | key
| key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
| 1 | SIMPLE | u | index | NULL |
id_idx | 5 | NULL | 972064 | Using index |
| 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx |
user_idx | 5 | getup.u.id | 20 | Using where |
+----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+


Cheers,
Leonardo Borges
www.leonardoborges.com


On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers wrote:

> Can you post show create table for activity and explain output of the
> problem query?
>
> On Jul 7, 2011 8:51 PM, "Leonardo Borges"
> wrote:
>
> Hello everyone,
>
> I have an increasingly popular web application running on top of mysql and
> due to its popularity, I'm running into performance issues. After carefully
> examining database indexes and tuning queries I was able to pin down the
> slowest part of the system.
>
> The app's got a user segmentation tool that allows you to filter users
> based
> on a range of criteria from which the slowest is: "Select all users that
> did
> not receive the email of id 100"
>
> To answer this question we turn to the activities table, which is basically
> a denormalized log of actions taken by the user in this format:
> user_id | activity | email_id | ...
> 10 | email_sent | 100 | ...
> 10 | subscribed | NULL | ...
> 10 | email_open | 100 | ...
>
>
> Given this table and the question above, the usual way of finding out all
> users who did not receive this email is through the use of a left outer
> join, such as:
>
> select u.id
> from users u
> left outer join activities a
> on u.id = a.user_id
> and a.activity = 'email_sent'
> and a.email_id = 100
> where a.user_id is null
>
> That's all fine for medium-ish tables. However our current activities
> table
> has over 13 million rows, slowing the hell out of this left outer join,
> taking about 52 seconds in my machine.
>
> What this query is trying to do is to get the relative complement of set
> A(users) to B(activities). As far as I know mysql doesn't support set
> subtraction, thus the reason for these queries being slow.
>
> Based on that I've setup a test database on Postgresql, which supports this
> very set operation and rewrote the query to look like this:
>
> select u.id
> from users u
> except
> select a.user_id
> from activities a
> where a.activity = 'email_sent'
> and a.email_id = 100;
>
> The fact that postgresql knows how to subtract sets brought this query down
> to only 4 seconds.
>
> My question then is: since this is a somewhat common query in our system,
> are there any workarounds I could use in mysql to improve things?
>
> I did find one myself, but it's a bit convoluted and might not perform well
> under load, but the following sql script gives me similar performance in
> mysql:
>
> create temporary table email_sent_100
> select a.user_id
> from user_activity_events a
> where a.activity = 'email_sent'
>
>
> and a.email_id = 100;
>
> create index user_id_idx on email_sent_100(user_id); //this could
> potentially bring the runtime down in the case of a larg temp table.
>
> select count(u.id)
> from users u
> left outer join email_sent_100 s
> on u.id = s.user_id
> and s.user_id is null;
>
> A lot more lines and a lot more complex, but does the job in this example.
>
> I'd appreciate your thoughts.
>
> Cheers,
> Leonardo Borges
> www.leonardoborges.com
>
>

--bcaec520e99d8b7fb004a7854a23--

Re: MySQL and set complements

am 08.07.2011 15:18:29 von Johnny Withers

--0016364d2c795ea3f804a78eab85
Content-Type: text/plain; charset=ISO-8859-1

Leonardo,

I think a new compound key on email_id and activity in the activities table
may help.

I'm not sure if this will help or not, Its hard to test w/o having a large
data set to test against.


On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges > wrote:

> Sure can:
>
> show create table activities;
>
> CREATE TABLE `activities` (
> `id` int(11) NOT NULL AUTO_INCREMENT,
> `user_id` int(11) DEFAULT NULL,
> `email` varchar(100) DEFAULT NULL,
> `country_iso` varchar(2) DEFAULT NULL,
> `tags` varchar(255) DEFAULT NULL,
> `postcode` int(11) DEFAULT NULL,
> `activity` varchar(100) DEFAULT NULL,
> `page_id` int(11) DEFAULT NULL,
> `donation_frequency` varchar(100) DEFAULT NULL,
> `email_id` int(11) DEFAULT NULL,
> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
> CURRENT_TIMESTAMP,
> PRIMARY KEY (`id`),
> KEY `user_idx` (`user_id`),
> KEY `email_idx` (`email_id`),
> KEY `activity_idx` (`activity`)
> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>
>
> And the explain:
>
>
> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
> | id | select_type | table | type | possible_keys | key
> | key_len | ref | rows | Extra |
>
> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
> | 1 | SIMPLE | u | index | NULL |
> id_idx | 5 | NULL | 972064 | Using index |
> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx |
> user_idx | 5 | getup.u.id | 20 | Using where |
>
> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>
>
> Cheers,
> Leonardo Borges
> www.leonardoborges.com
>
>
> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers wrote:
>
>> Can you post show create table for activity and explain output of the
>> problem query?
>>
>> On Jul 7, 2011 8:51 PM, "Leonardo Borges"
>> wrote:
>>
>> Hello everyone,
>>
>> I have an increasingly popular web application running on top of mysql and
>> due to its popularity, I'm running into performance issues. After
>> carefully
>> examining database indexes and tuning queries I was able to pin down the
>> slowest part of the system.
>>
>> The app's got a user segmentation tool that allows you to filter users
>> based
>> on a range of criteria from which the slowest is: "Select all users that
>> did
>> not receive the email of id 100"
>>
>> To answer this question we turn to the activities table, which is
>> basically
>> a denormalized log of actions taken by the user in this format:
>> user_id | activity | email_id | ...
>> 10 | email_sent | 100 | ...
>> 10 | subscribed | NULL | ...
>> 10 | email_open | 100 | ...
>>
>>
>> Given this table and the question above, the usual way of finding out all
>> users who did not receive this email is through the use of a left outer
>> join, such as:
>>
>> select u.id
>> from users u
>> left outer join activities a
>> on u.id = a.user_id
>> and a.activity = 'email_sent'
>> and a.email_id = 100
>> where a.user_id is null
>>
>> That's all fine for medium-ish tables. However our current activities
>> table
>> has over 13 million rows, slowing the hell out of this left outer join,
>> taking about 52 seconds in my machine.
>>
>> What this query is trying to do is to get the relative complement of set
>> A(users) to B(activities). As far as I know mysql doesn't support set
>> subtraction, thus the reason for these queries being slow.
>>
>> Based on that I've setup a test database on Postgresql, which supports
>> this
>> very set operation and rewrote the query to look like this:
>>
>> select u.id
>> from users u
>> except
>> select a.user_id
>> from activities a
>> where a.activity = 'email_sent'
>> and a.email_id = 100;
>>
>> The fact that postgresql knows how to subtract sets brought this query
>> down
>> to only 4 seconds.
>>
>> My question then is: since this is a somewhat common query in our system,
>> are there any workarounds I could use in mysql to improve things?
>>
>> I did find one myself, but it's a bit convoluted and might not perform
>> well
>> under load, but the following sql script gives me similar performance in
>> mysql:
>>
>> create temporary table email_sent_100
>> select a.user_id
>> from user_activity_events a
>> where a.activity = 'email_sent'
>>
>>
>> and a.email_id = 100;
>>
>> create index user_id_idx on email_sent_100(user_id); //this could
>> potentially bring the runtime down in the case of a larg temp table.
>>
>> select count(u.id)
>> from users u
>> left outer join email_sent_100 s
>> on u.id = s.user_id
>> and s.user_id is null;
>>
>> A lot more lines and a lot more complex, but does the job in this example.
>>
>> I'd appreciate your thoughts.
>>
>> Cheers,
>> Leonardo Borges
>> www.leonardoborges.com
>>
>>
>


--
-----------------------------
Johnny Withers
601.209.4985
johnny@pixelated.net

--0016364d2c795ea3f804a78eab85--

Re: MySQL and set complements

am 08.07.2011 15:53:57 von Leonardo Borges

--bcaec52154cf36afe804a78f2a23
Content-Type: text/plain; charset=ISO-8859-1

Hi Johnny,

I just gave that a try but it didn't help as I suspected.

I still believe the problem is in mysql not being able to handle set
subtractions. Therefore, it has to perform the work harder to return the
rows that represent a "no match" with NULL values in place so they can then
be filtered by the WHERE clause.


This type of query seems to be a corner case in mysql one should be aware
about when working with large datasets.

Cheers,
Leonardo Borges
www.leonardoborges.com


On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers wrote:

> Leonardo,
>
> I think a new compound key on email_id and activity in the activities table
> may help.
>
> I'm not sure if this will help or not, Its hard to test w/o having a large
> data set to test against.
>
>
> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
> leonardoborges.rj@gmail.com> wrote:
>
>> Sure can:
>>
>> show create table activities;
>>
>> CREATE TABLE `activities` (
>> `id` int(11) NOT NULL AUTO_INCREMENT,
>> `user_id` int(11) DEFAULT NULL,
>> `email` varchar(100) DEFAULT NULL,
>> `country_iso` varchar(2) DEFAULT NULL,
>> `tags` varchar(255) DEFAULT NULL,
>> `postcode` int(11) DEFAULT NULL,
>> `activity` varchar(100) DEFAULT NULL,
>> `page_id` int(11) DEFAULT NULL,
>> `donation_frequency` varchar(100) DEFAULT NULL,
>> `email_id` int(11) DEFAULT NULL,
>> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>> CURRENT_TIMESTAMP,
>> PRIMARY KEY (`id`),
>> KEY `user_idx` (`user_id`),
>> KEY `email_idx` (`email_id`),
>> KEY `activity_idx` (`activity`)
>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>>
>>
>> And the explain:
>>
>>
>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>> | id | select_type | table | type | possible_keys | key
>> | key_len | ref | rows | Extra |
>>
>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>> | 1 | SIMPLE | u | index | NULL |
>> id_idx | 5 | NULL | 972064 | Using index |
>> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx |
>> user_idx | 5 | getup.u.id | 20 | Using where |
>>
>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>>
>>
>> Cheers,
>> Leonardo Borges
>> www.leonardoborges.com
>>
>>
>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers wrote:
>>
>>> Can you post show create table for activity and explain output of the
>>> problem query?
>>>
>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges"
>>> wrote:
>>>
>>> Hello everyone,
>>>
>>> I have an increasingly popular web application running on top of mysql
>>> and
>>> due to its popularity, I'm running into performance issues. After
>>> carefully
>>> examining database indexes and tuning queries I was able to pin down the
>>> slowest part of the system.
>>>
>>> The app's got a user segmentation tool that allows you to filter users
>>> based
>>> on a range of criteria from which the slowest is: "Select all users that
>>> did
>>> not receive the email of id 100"
>>>
>>> To answer this question we turn to the activities table, which is
>>> basically
>>> a denormalized log of actions taken by the user in this format:
>>> user_id | activity | email_id | ...
>>> 10 | email_sent | 100 | ...
>>> 10 | subscribed | NULL | ...
>>> 10 | email_open | 100 | ...
>>>
>>>
>>> Given this table and the question above, the usual way of finding out all
>>> users who did not receive this email is through the use of a left outer
>>> join, such as:
>>>
>>> select u.id
>>> from users u
>>> left outer join activities a
>>> on u.id = a.user_id
>>> and a.activity = 'email_sent'
>>> and a.email_id = 100
>>> where a.user_id is null
>>>
>>> That's all fine for medium-ish tables. However our current activities
>>> table
>>> has over 13 million rows, slowing the hell out of this left outer join,
>>> taking about 52 seconds in my machine.
>>>
>>> What this query is trying to do is to get the relative complement of set
>>> A(users) to B(activities). As far as I know mysql doesn't support set
>>> subtraction, thus the reason for these queries being slow.
>>>
>>> Based on that I've setup a test database on Postgresql, which supports
>>> this
>>> very set operation and rewrote the query to look like this:
>>>
>>> select u.id
>>> from users u
>>> except
>>> select a.user_id
>>> from activities a
>>> where a.activity = 'email_sent'
>>> and a.email_id = 100;
>>>
>>> The fact that postgresql knows how to subtract sets brought this query
>>> down
>>> to only 4 seconds.
>>>
>>> My question then is: since this is a somewhat common query in our system,
>>> are there any workarounds I could use in mysql to improve things?
>>>
>>> I did find one myself, but it's a bit convoluted and might not perform
>>> well
>>> under load, but the following sql script gives me similar performance in
>>> mysql:
>>>
>>> create temporary table email_sent_100
>>> select a.user_id
>>> from user_activity_events a
>>> where a.activity = 'email_sent'
>>>
>>>
>>> and a.email_id = 100;
>>>
>>> create index user_id_idx on email_sent_100(user_id); //this could
>>> potentially bring the runtime down in the case of a larg temp table.
>>>
>>> select count(u.id)
>>> from users u
>>> left outer join email_sent_100 s
>>> on u.id = s.user_id
>>> and s.user_id is null;
>>>
>>> A lot more lines and a lot more complex, but does the job in this
>>> example.
>>>
>>> I'd appreciate your thoughts.
>>>
>>> Cheers,
>>> Leonardo Borges
>>> www.leonardoborges.com
>>>
>>>
>>
>
>
> --
> -----------------------------
> Johnny Withers
> 601.209.4985
> johnny@pixelated.net
>

--bcaec52154cf36afe804a78f2a23--

Re: MySQL and set complements

am 08.07.2011 16:00:12 von Johnny Withers

--000e0cd4849e99a99c04a78f404a
Content-Type: text/plain; charset=ISO-8859-1

What did the explain output look like after the new index?


On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges > wrote:

> Hi Johnny,
>
> I just gave that a try but it didn't help as I suspected.
>
> I still believe the problem is in mysql not being able to handle set
> subtractions. Therefore, it has to perform the work harder to return the
> rows that represent a "no match" with NULL values in place so they can then
> be filtered by the WHERE clause.
>
>
> This type of query seems to be a corner case in mysql one should be aware
> about when working with large datasets.
>
> Cheers,
> Leonardo Borges
> www.leonardoborges.com
>
>
> On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers wrote:
>
>> Leonardo,
>>
>> I think a new compound key on email_id and activity in the activities
>> table may help.
>>
>> I'm not sure if this will help or not, Its hard to test w/o having a large
>> data set to test against.
>>
>>
>> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
>> leonardoborges.rj@gmail.com> wrote:
>>
>>> Sure can:
>>>
>>> show create table activities;
>>>
>>> CREATE TABLE `activities` (
>>> `id` int(11) NOT NULL AUTO_INCREMENT,
>>> `user_id` int(11) DEFAULT NULL,
>>> `email` varchar(100) DEFAULT NULL,
>>> `country_iso` varchar(2) DEFAULT NULL,
>>> `tags` varchar(255) DEFAULT NULL,
>>> `postcode` int(11) DEFAULT NULL,
>>> `activity` varchar(100) DEFAULT NULL,
>>> `page_id` int(11) DEFAULT NULL,
>>> `donation_frequency` varchar(100) DEFAULT NULL,
>>> `email_id` int(11) DEFAULT NULL,
>>> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>>> CURRENT_TIMESTAMP,
>>> PRIMARY KEY (`id`),
>>> KEY `user_idx` (`user_id`),
>>> KEY `email_idx` (`email_id`),
>>> KEY `activity_idx` (`activity`)
>>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>>>
>>>
>>> And the explain:
>>>
>>>
>>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>>> | id | select_type | table | type | possible_keys |
>>> key | key_len | ref | rows | Extra |
>>>
>>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>>> | 1 | SIMPLE | u | index | NULL |
>>> id_idx | 5 | NULL | 972064 | Using index |
>>> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx |
>>> user_idx | 5 | getup.u.id | 20 | Using where |
>>>
>>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>>>
>>>
>>> Cheers,
>>> Leonardo Borges
>>> www.leonardoborges.com
>>>
>>>
>>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers wrote:
>>>
>>>> Can you post show create table for activity and explain output of the
>>>> problem query?
>>>>
>>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges"
>>>> wrote:
>>>>
>>>> Hello everyone,
>>>>
>>>> I have an increasingly popular web application running on top of mysql
>>>> and
>>>> due to its popularity, I'm running into performance issues. After
>>>> carefully
>>>> examining database indexes and tuning queries I was able to pin down the
>>>> slowest part of the system.
>>>>
>>>> The app's got a user segmentation tool that allows you to filter users
>>>> based
>>>> on a range of criteria from which the slowest is: "Select all users that
>>>> did
>>>> not receive the email of id 100"
>>>>
>>>> To answer this question we turn to the activities table, which is
>>>> basically
>>>> a denormalized log of actions taken by the user in this format:
>>>> user_id | activity | email_id | ...
>>>> 10 | email_sent | 100 | ...
>>>> 10 | subscribed | NULL | ...
>>>> 10 | email_open | 100 | ...
>>>>
>>>>
>>>> Given this table and the question above, the usual way of finding out
>>>> all
>>>> users who did not receive this email is through the use of a left outer
>>>> join, such as:
>>>>
>>>> select u.id
>>>> from users u
>>>> left outer join activities a
>>>> on u.id = a.user_id
>>>> and a.activity = 'email_sent'
>>>> and a.email_id = 100
>>>> where a.user_id is null
>>>>
>>>> That's all fine for medium-ish tables. However our current activities
>>>> table
>>>> has over 13 million rows, slowing the hell out of this left outer join,
>>>> taking about 52 seconds in my machine.
>>>>
>>>> What this query is trying to do is to get the relative complement of set
>>>> A(users) to B(activities). As far as I know mysql doesn't support set
>>>> subtraction, thus the reason for these queries being slow.
>>>>
>>>> Based on that I've setup a test database on Postgresql, which supports
>>>> this
>>>> very set operation and rewrote the query to look like this:
>>>>
>>>> select u.id
>>>> from users u
>>>> except
>>>> select a.user_id
>>>> from activities a
>>>> where a.activity = 'email_sent'
>>>> and a.email_id = 100;
>>>>
>>>> The fact that postgresql knows how to subtract sets brought this query
>>>> down
>>>> to only 4 seconds.
>>>>
>>>> My question then is: since this is a somewhat common query in our
>>>> system,
>>>> are there any workarounds I could use in mysql to improve things?
>>>>
>>>> I did find one myself, but it's a bit convoluted and might not perform
>>>> well
>>>> under load, but the following sql script gives me similar performance in
>>>> mysql:
>>>>
>>>> create temporary table email_sent_100
>>>> select a.user_id
>>>> from user_activity_events a
>>>> where a.activity = 'email_sent'
>>>>
>>>>
>>>> and a.email_id = 100;
>>>>
>>>> create index user_id_idx on email_sent_100(user_id); //this could
>>>> potentially bring the runtime down in the case of a larg temp table.
>>>>
>>>> select count(u.id)
>>>> from users u
>>>> left outer join email_sent_100 s
>>>> on u.id = s.user_id
>>>> and s.user_id is null;
>>>>
>>>> A lot more lines and a lot more complex, but does the job in this
>>>> example.
>>>>
>>>> I'd appreciate your thoughts.
>>>>
>>>> Cheers,
>>>> Leonardo Borges
>>>> www.leonardoborges.com
>>>>
>>>>
>>>
>>
>>
>> --
>> -----------------------------
>> Johnny Withers
>> 601.209.4985
>> johnny@pixelated.net
>>
>
>


--
-----------------------------
Johnny Withers
601.209.4985
johnny@pixelated.net

--000e0cd4849e99a99c04a78f404a--

Re: MySQL and set complements

am 08.07.2011 16:19:37 von Leonardo Borges

--bcaec52154cf0971b804a78f8663
Content-Type: text/plain; charset=ISO-8859-1

Same as before, but with the new index listed in the possible keys:

+----+-------------+-------+-------+------------------------ ----------------------+----------+---------+------------+--- -----+-------------+
| id | select_type | table | type | possible_keys
| key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+------------------------ ----------------------+----------+---------+------------+--- -----+-------------+
| 1 | SIMPLE | u | index | NULL
| id_idx | 5 | NULL | 972064 | Using index |
| 1 | SIMPLE | a | ref |
user_idx,email_idx,activity_idx,compound_idx | user_idx | 5 |
getup.u.id | 20 | Using where |
+----+-------------+-------+-------+------------------------ ----------------------+----------+---------+------------+--- -----+-------------+



On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers wrote:

> What did the explain output look like after the new index?
>
>
> On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges <
> leonardoborges.rj@gmail.com> wrote:
>
>> Hi Johnny,
>>
>> I just gave that a try but it didn't help as I suspected.
>>
>> I still believe the problem is in mysql not being able to handle set
>> subtractions. Therefore, it has to perform the work harder to return the
>> rows that represent a "no match" with NULL values in place so they can then
>> be filtered by the WHERE clause.
>>
>>
>> This type of query seems to be a corner case in mysql one should be aware
>> about when working with large datasets.
>>
>> Cheers,
>> Leonardo Borges
>> www.leonardoborges.com
>>
>>
>> On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers wrote:
>>
>>> Leonardo,
>>>
>>> I think a new compound key on email_id and activity in the activities
>>> table may help.
>>>
>>> I'm not sure if this will help or not, Its hard to test w/o having a
>>> large data set to test against.
>>>
>>>
>>> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
>>> leonardoborges.rj@gmail.com> wrote:
>>>
>>>> Sure can:
>>>>
>>>> show create table activities;
>>>>
>>>> CREATE TABLE `activities` (
>>>> `id` int(11) NOT NULL AUTO_INCREMENT,
>>>> `user_id` int(11) DEFAULT NULL,
>>>> `email` varchar(100) DEFAULT NULL,
>>>> `country_iso` varchar(2) DEFAULT NULL,
>>>> `tags` varchar(255) DEFAULT NULL,
>>>> `postcode` int(11) DEFAULT NULL,
>>>> `activity` varchar(100) DEFAULT NULL,
>>>> `page_id` int(11) DEFAULT NULL,
>>>> `donation_frequency` varchar(100) DEFAULT NULL,
>>>> `email_id` int(11) DEFAULT NULL,
>>>> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>>>> CURRENT_TIMESTAMP,
>>>> PRIMARY KEY (`id`),
>>>> KEY `user_idx` (`user_id`),
>>>> KEY `email_idx` (`email_id`),
>>>> KEY `activity_idx` (`activity`)
>>>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>>>>
>>>>
>>>> And the explain:
>>>>
>>>>
>>>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>>>> | id | select_type | table | type | possible_keys |
>>>> key | key_len | ref | rows | Extra |
>>>>
>>>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>>>> | 1 | SIMPLE | u | index | NULL |
>>>> id_idx | 5 | NULL | 972064 | Using index |
>>>> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx |
>>>> user_idx | 5 | getup.u.id | 20 | Using where |
>>>>
>>>> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
>>>>
>>>>
>>>> Cheers,
>>>> Leonardo Borges
>>>> www.leonardoborges.com
>>>>
>>>>
>>>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers wrote:
>>>>
>>>>> Can you post show create table for activity and explain output of the
>>>>> problem query?
>>>>>
>>>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges"
>>>>> wrote:
>>>>>
>>>>> Hello everyone,
>>>>>
>>>>> I have an increasingly popular web application running on top of mysql
>>>>> and
>>>>> due to its popularity, I'm running into performance issues. After
>>>>> carefully
>>>>> examining database indexes and tuning queries I was able to pin down
>>>>> the
>>>>> slowest part of the system.
>>>>>
>>>>> The app's got a user segmentation tool that allows you to filter users
>>>>> based
>>>>> on a range of criteria from which the slowest is: "Select all users
>>>>> that did
>>>>> not receive the email of id 100"
>>>>>
>>>>> To answer this question we turn to the activities table, which is
>>>>> basically
>>>>> a denormalized log of actions taken by the user in this format:
>>>>> user_id | activity | email_id | ...
>>>>> 10 | email_sent | 100 | ...
>>>>> 10 | subscribed | NULL | ...
>>>>> 10 | email_open | 100 | ...
>>>>>
>>>>>
>>>>> Given this table and the question above, the usual way of finding out
>>>>> all
>>>>> users who did not receive this email is through the use of a left outer
>>>>> join, such as:
>>>>>
>>>>> select u.id
>>>>> from users u
>>>>> left outer join activities a
>>>>> on u.id = a.user_id
>>>>> and a.activity = 'email_sent'
>>>>> and a.email_id = 100
>>>>> where a.user_id is null
>>>>>
>>>>> That's all fine for medium-ish tables. However our current activities
>>>>> table
>>>>> has over 13 million rows, slowing the hell out of this left outer join,
>>>>> taking about 52 seconds in my machine.
>>>>>
>>>>> What this query is trying to do is to get the relative complement of
>>>>> set
>>>>> A(users) to B(activities). As far as I know mysql doesn't support set
>>>>> subtraction, thus the reason for these queries being slow.
>>>>>
>>>>> Based on that I've setup a test database on Postgresql, which supports
>>>>> this
>>>>> very set operation and rewrote the query to look like this:
>>>>>
>>>>> select u.id
>>>>> from users u
>>>>> except
>>>>> select a.user_id
>>>>> from activities a
>>>>> where a.activity = 'email_sent'
>>>>> and a.email_id = 100;
>>>>>
>>>>> The fact that postgresql knows how to subtract sets brought this query
>>>>> down
>>>>> to only 4 seconds.
>>>>>
>>>>> My question then is: since this is a somewhat common query in our
>>>>> system,
>>>>> are there any workarounds I could use in mysql to improve things?
>>>>>
>>>>> I did find one myself, but it's a bit convoluted and might not perform
>>>>> well
>>>>> under load, but the following sql script gives me similar performance
>>>>> in
>>>>> mysql:
>>>>>
>>>>> create temporary table email_sent_100
>>>>> select a.user_id
>>>>> from user_activity_events a
>>>>> where a.activity = 'email_sent'
>>>>>
>>>>>
>>>>> and a.email_id = 100;
>>>>>
>>>>> create index user_id_idx on email_sent_100(user_id); //this could
>>>>> potentially bring the runtime down in the case of a larg temp table.
>>>>>
>>>>> select count(u.id)
>>>>> from users u
>>>>> left outer join email_sent_100 s
>>>>> on u.id = s.user_id
>>>>> and s.user_id is null;
>>>>>
>>>>> A lot more lines and a lot more complex, but does the job in this
>>>>> example.
>>>>>
>>>>> I'd appreciate your thoughts.
>>>>>
>>>>> Cheers,
>>>>> Leonardo Borges
>>>>> www.leonardoborges.com
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> -----------------------------
>>> Johnny Withers
>>> 601.209.4985
>>> johnny@pixelated.net
>>>
>>
>>
>
>
> --
> -----------------------------
> Johnny Withers
> 601.209.4985
> johnny@pixelated.net
>

--bcaec52154cf0971b804a78f8663--

Re: MySQL and set complements

am 08.07.2011 23:24:51 von mos

Leonardo,
What happens when you use "force index(user_id)" ?

See http://dev.mysql.com/doc/refman/5.1/en/index-hints.html

Mike

At 09:19 AM 7/8/2011, you wrote:
>Same as before, but with the new index listed in the possible keys:
>
>+----+-------------+-------+-------+----------------------- -----------------------+----------+---------+------------+-- ------+-------------+
>| id | select_type | table | type | possible_keys
> | key | key_len | ref | rows | Extra |
>+----+-------------+-------+-------+----------------------- -----------------------+----------+---------+------------+-- ------+-------------+
>| 1 | SIMPLE | u | index | NULL
> | id_idx | 5 | NULL | 972064 | Using index |
>| 1 | SIMPLE | a | ref |
>user_idx,email_idx,activity_idx,compound_idx | user_idx | 5 |
>getup.u.id | 20 | Using where |
>+----+-------------+-------+-------+----------------------- -----------------------+----------+---------+------------+-- ------+-------------+
>
>
>
>On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers wrote:
>
> > What did the explain output look like after the new index?
> >
> >
> > On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges <
> > leonardoborges.rj@gmail.com> wrote:
> >
> >> Hi Johnny,
> >>
> >> I just gave that a try but it didn't help as I suspected.
> >>
> >> I still believe the problem is in mysql not being able to handle set
> >> subtractions. Therefore, it has to perform the work harder to return the
> >> rows that represent a "no match" with NULL values in place so they can
> then
> >> be filtered by the WHERE clause.
> >>
> >>
> >> This type of query seems to be a corner case in mysql one should be aware
> >> about when working with large datasets.
> >>
> >> Cheers,
> >> Leonardo Borges
> >> www.leonardoborges.com
> >>
> >>
> >> On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers
> wrote:
> >>
> >>> Leonardo,
> >>>
> >>> I think a new compound key on email_id and activity in the activities
> >>> table may help.
> >>>
> >>> I'm not sure if this will help or not, Its hard to test w/o having a
> >>> large data set to test against.
> >>>
> >>>
> >>> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
> >>> leonardoborges.rj@gmail.com> wrote:
> >>>
> >>>> Sure can:
> >>>>
> >>>> show create table activities;
> >>>>
> >>>> CREATE TABLE `activities` (
> >>>> `id` int(11) NOT NULL AUTO_INCREMENT,
> >>>> `user_id` int(11) DEFAULT NULL,
> >>>> `email` varchar(100) DEFAULT NULL,
> >>>> `country_iso` varchar(2) DEFAULT NULL,
> >>>> `tags` varchar(255) DEFAULT NULL,
> >>>> `postcode` int(11) DEFAULT NULL,
> >>>> `activity` varchar(100) DEFAULT NULL,
> >>>> `page_id` int(11) DEFAULT NULL,
> >>>> `donation_frequency` varchar(100) DEFAULT NULL,
> >>>> `email_id` int(11) DEFAULT NULL,
> >>>> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
> >>>> CURRENT_TIMESTAMP,
> >>>> PRIMARY KEY (`id`),
> >>>> KEY `user_idx` (`user_id`),
> >>>> KEY `email_idx` (`email_id`),
> >>>> KEY `activity_idx` (`activity`)
> >>>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
> >>>>
> >>>>
> >>>> And the explain:
> >>>>
> >>>>
> >>>>
> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
> >>>> | id | select_type | table | type | possible_keys |
> >>>> key | key_len | ref | rows | Extra |
> >>>>
> >>>>
> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
> >>>> | 1 | SIMPLE | u | index | NULL |
> >>>> id_idx | 5 | NULL | 972064 | Using index |
> >>>> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_idx |
> >>>> user_idx | 5 | getup.u.id | 20 | Using where |
> >>>>
> >>>>
> +----+-------------+-------+-------+------------------------ ---------+----------+---------+------------+--------+------- ------+
> >>>>
> >>>>
> >>>> Cheers,
> >>>> Leonardo Borges
> >>>> www.leonardoborges.com
> >>>>
> >>>>
> >>>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers
> wrote:
> >>>>
> >>>>> Can you post show create table for activity and explain output of the
> >>>>> problem query?
> >>>>>
> >>>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges"
> >>>>> wrote:
> >>>>>
> >>>>> Hello everyone,
> >>>>>
> >>>>> I have an increasingly popular web application running on top of mysql
> >>>>> and
> >>>>> due to its popularity, I'm running into performance issues. After
> >>>>> carefully
> >>>>> examining database indexes and tuning queries I was able to pin down
> >>>>> the
> >>>>> slowest part of the system.
> >>>>>
> >>>>> The app's got a user segmentation tool that allows you to filter users
> >>>>> based
> >>>>> on a range of criteria from which the slowest is: "Select all users
> >>>>> that did
> >>>>> not receive the email of id 100"
> >>>>>
> >>>>> To answer this question we turn to the activities table, which is
> >>>>> basically
> >>>>> a denormalized log of actions taken by the user in this format:
> >>>>> user_id | activity | email_id | ...
> >>>>> 10 | email_sent | 100 | ...
> >>>>> 10 | subscribed | NULL | ...
> >>>>> 10 | email_open | 100 | ...
> >>>>>
> >>>>>
> >>>>> Given this table and the question above, the usual way of finding out
> >>>>> all
> >>>>> users who did not receive this email is through the use of a left outer
> >>>>> join, such as:
> >>>>>
> >>>>> select u.id
> >>>>> from users u
> >>>>> left outer join activities a
> >>>>> on u.id = a.user_id
> >>>>> and a.activity = 'email_sent'
> >>>>> and a.email_id = 100
> >>>>> where a.user_id is null
> >>>>>
> >>>>> That's all fine for medium-ish tables. However our current activities
> >>>>> table
> >>>>> has over 13 million rows, slowing the hell out of this left outer join,
> >>>>> taking about 52 seconds in my machine.
> >>>>>
> >>>>> What this query is trying to do is to get the relative complement of
> >>>>> set
> >>>>> A(users) to B(activities). As far as I know mysql doesn't support set
> >>>>> subtraction, thus the reason for these queries being slow.
> >>>>>
> >>>>> Based on that I've setup a test database on Postgresql, which supports
> >>>>> this
> >>>>> very set operation and rewrote the query to look like this:
> >>>>>
> >>>>> select u.id
> >>>>> from users u
> >>>>> except
> >>>>> select a.user_id
> >>>>> from activities a
> >>>>> where a.activity = 'email_sent'
> >>>>> and a.email_id = 100;
> >>>>>
> >>>>> The fact that postgresql knows how to subtract sets brought this query
> >>>>> down
> >>>>> to only 4 seconds.
> >>>>>
> >>>>> My question then is: since this is a somewhat common query in our
> >>>>> system,
> >>>>> are there any workarounds I could use in mysql to improve things?
> >>>>>
> >>>>> I did find one myself, but it's a bit convoluted and might not perform
> >>>>> well
> >>>>> under load, but the following sql script gives me similar performance
> >>>>> in
> >>>>> mysql:
> >>>>>
> >>>>> create temporary table email_sent_100
> >>>>> select a.user_id
> >>>>> from user_activity_events a
> >>>>> where a.activity = 'email_sent'
> >>>>>
> >>>>>
> >>>>> and a.email_id = 100;
> >>>>>
> >>>>> create index user_id_idx on email_sent_100(user_id); //this could
> >>>>> potentially bring the runtime down in the case of a larg temp table.
> >>>>>
> >>>>> select count(u.id)
> >>>>> from users u
> >>>>> left outer join email_sent_100 s
> >>>>> on u.id = s.user_id
> >>>>> and s.user_id is null;
> >>>>>
> >>>>> A lot more lines and a lot more complex, but does the job in this
> >>>>> example.
> >>>>>
> >>>>> I'd appreciate your thoughts.
> >>>>>
> >>>>> Cheers,
> >>>>> Leonardo Borges
> >>>>> www.leonardoborges.com
> >>>>>
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> -----------------------------
> >>> Johnny Withers
> >>> 601.209.4985
> >>> johnny@pixelated.net
> >>>
> >>
> >>
> >
> >
> > --
> > -----------------------------
> > Johnny Withers
> > 601.209.4985
> > johnny@pixelated.net
> >


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: MySQL and set complements

am 12.07.2011 15:48:48 von Leonardo Borges

--bcaec53961dc354fc704a7df8fe9
Content-Type: text/plain; charset=ISO-8859-1

Just now realized I answered to Mike only.... oops.

So posting it again... forcing the use of the use_id index didn't really
improve things, unfortunately.


Cheers,
Leonardo Borges
www.leonardoborges.com


On Sat, Jul 9, 2011 at 7:24 AM, mos wrote:

> Leonardo,
> What happens when you use "force index(user_id)" ?
>
> See http://dev.mysql.com/doc/**refman/5.1/en/index-hints.html
>
> Mike
>
> At 09:19 AM 7/8/2011, you wrote:
>
>> Same as before, but with the new index listed in the possible keys:
>>
>> +----+-------------+-------+--**-----+---------------------- --**
>> ----------------------+-------**---+---------+------------+- --**
>> -----+-------------+
>> | id | select_type | table | type | possible_keys
>> | key | key_len | ref | rows | Extra |
>> +----+-------------+-------+--**-----+---------------------- --**
>> ----------------------+-------**---+---------+------------+- --**
>> -----+-------------+
>> | 1 | SIMPLE | u | index | NULL
>> | id_idx | 5 | NULL | 972064 | Using index |
>> | 1 | SIMPLE | a | ref |
>> user_idx,email_idx,activity_**idx,compound_idx | user_idx | 5 |
>> getup.u.id | 20 | Using where |
>> +----+-------------+-------+--**-----+---------------------- --**
>> ----------------------+-------**---+---------+------------+- --**
>> -----+-------------+
>>
>>
>>
>> On Sat, Jul 9, 2011 at 12:00 AM, Johnny Withers >> >wrote:
>>
>> > What did the explain output look like after the new index?
>> >
>> >
>> > On Fri, Jul 8, 2011 at 8:53 AM, Leonardo Borges <
>> > leonardoborges.rj@gmail.com> wrote:
>> >
>> >> Hi Johnny,
>> >>
>> >> I just gave that a try but it didn't help as I suspected.
>> >>
>> >> I still believe the problem is in mysql not being able to handle set
>> >> subtractions. Therefore, it has to perform the work harder to return
>> the
>> >> rows that represent a "no match" with NULL values in place so they can
>> then
>> >> be filtered by the WHERE clause.
>> >>
>> >>
>> >> This type of query seems to be a corner case in mysql one should be
>> aware
>> >> about when working with large datasets.
>> >>
>> >> Cheers,
>> >> Leonardo Borges
>> >> www.leonardoborges.com
>> >>
>> >>
>> >> On Fri, Jul 8, 2011 at 11:18 PM, Johnny Withers >> >wrote:
>> >>
>> >>> Leonardo,
>> >>>
>> >>> I think a new compound key on email_id and activity in the activities
>> >>> table may help.
>> >>>
>> >>> I'm not sure if this will help or not, Its hard to test w/o having a
>> >>> large data set to test against.
>> >>>
>> >>>
>> >>> On Thu, Jul 7, 2011 at 9:07 PM, Leonardo Borges <
>> >>> leonardoborges.rj@gmail.com> wrote:
>> >>>
>> >>>> Sure can:
>> >>>>
>> >>>> show create table activities;
>> >>>>
>> >>>> CREATE TABLE `activities` (
>> >>>> `id` int(11) NOT NULL AUTO_INCREMENT,
>> >>>> `user_id` int(11) DEFAULT NULL,
>> >>>> `email` varchar(100) DEFAULT NULL,
>> >>>> `country_iso` varchar(2) DEFAULT NULL,
>> >>>> `tags` varchar(255) DEFAULT NULL,
>> >>>> `postcode` int(11) DEFAULT NULL,
>> >>>> `activity` varchar(100) DEFAULT NULL,
>> >>>> `page_id` int(11) DEFAULT NULL,
>> >>>> `donation_frequency` varchar(100) DEFAULT NULL,
>> >>>> `email_id` int(11) DEFAULT NULL,
>> >>>> `created_at` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE
>> >>>> CURRENT_TIMESTAMP,
>> >>>> PRIMARY KEY (`id`),
>> >>>> KEY `user_idx` (`user_id`),
>> >>>> KEY `email_idx` (`email_id`),
>> >>>> KEY `activity_idx` (`activity`)
>> >>>> ) ENGINE=MyISAM AUTO_INCREMENT=11331976 DEFAULT CHARSET=latin1
>> >>>>
>> >>>>
>> >>>> And the explain:
>> >>>>
>> >>>>
>> >>>> +----+-------------+-------+--**-----+---------------------- --**
>> ---------+----------+---------**+------------+--------+----- --**------+
>> >>>> | id | select_type | table | type | possible_keys
>> |
>> >>>> key | key_len | ref | rows | Extra |
>> >>>>
>> >>>> +----+-------------+-------+--**-----+---------------------- --**
>> ---------+----------+---------**+------------+--------+----- --**------+
>> >>>> | 1 | SIMPLE | u | index | NULL
>> |
>> >>>> id_idx | 5 | NULL | 972064 | Using index |
>> >>>> | 1 | SIMPLE | a | ref | user_idx,email_idx,activity_**idx
>> |
>> >>>> user_idx | 5 | getup.u.id | 20 | Using where |
>> >>>>
>> >>>> +----+-------------+-------+--**-----+---------------------- --**
>> ---------+----------+---------**+------------+--------+----- --**------+
>> >>>>
>> >>>>
>> >>>> Cheers,
>> >>>> Leonardo Borges
>> >>>> www.leonardoborges.com
>> >>>>
>> >>>>
>> >>>> On Fri, Jul 8, 2011 at 11:58 AM, Johnny Withers <
>> johnny@pixelated.net>wrote:
>> >>>>
>> >>>>> Can you post show create table for activity and explain output of
>> the
>> >>>>> problem query?
>> >>>>>
>> >>>>> On Jul 7, 2011 8:51 PM, "Leonardo Borges" <
>> leonardoborges.rj@gmail.com>
>> >>>>> wrote:
>> >>>>>
>> >>>>> Hello everyone,
>> >>>>>
>> >>>>> I have an increasingly popular web application running on top of
>> mysql
>> >>>>> and
>> >>>>> due to its popularity, I'm running into performance issues. After
>> >>>>> carefully
>> >>>>> examining database indexes and tuning queries I was able to pin down
>> >>>>> the
>> >>>>> slowest part of the system.
>> >>>>>
>> >>>>> The app's got a user segmentation tool that allows you to filter
>> users
>> >>>>> based
>> >>>>> on a range of criteria from which the slowest is: "Select all users
>> >>>>> that did
>> >>>>> not receive the email of id 100"
>> >>>>>
>> >>>>> To answer this question we turn to the activities table, which is
>> >>>>> basically
>> >>>>> a denormalized log of actions taken by the user in this format:
>> >>>>> user_id | activity | email_id | ...
>> >>>>> 10 | email_sent | 100 | ...
>> >>>>> 10 | subscribed | NULL | ...
>> >>>>> 10 | email_open | 100 | ...
>> >>>>>
>> >>>>>
>> >>>>> Given this table and the question above, the usual way of finding
>> out
>> >>>>> all
>> >>>>> users who did not receive this email is through the use of a left
>> outer
>> >>>>> join, such as:
>> >>>>>
>> >>>>> select u.id
>> >>>>> from users u
>> >>>>> left outer join activities a
>> >>>>> on u.id = a.user_id
>> >>>>> and a.activity = 'email_sent'
>> >>>>> and a.email_id = 100
>> >>>>> where a.user_id is null
>> >>>>>
>> >>>>> That's all fine for medium-ish tables. However our current
>> activities
>> >>>>> table
>> >>>>> has over 13 million rows, slowing the hell out of this left outer
>> join,
>> >>>>> taking about 52 seconds in my machine.
>> >>>>>
>> >>>>> What this query is trying to do is to get the relative complement of
>> >>>>> set
>> >>>>> A(users) to B(activities). As far as I know mysql doesn't support
>> set
>> >>>>> subtraction, thus the reason for these queries being slow.
>> >>>>>
>> >>>>> Based on that I've setup a test database on Postgresql, which
>> supports
>> >>>>> this
>> >>>>> very set operation and rewrote the query to look like this:
>> >>>>>
>> >>>>> select u.id
>> >>>>> from users u
>> >>>>> except
>> >>>>> select a.user_id
>> >>>>> from activities a
>> >>>>> where a.activity = 'email_sent'
>> >>>>> and a.email_id = 100;
>> >>>>>
>> >>>>> The fact that postgresql knows how to subtract sets brought this
>> query
>> >>>>> down
>> >>>>> to only 4 seconds.
>> >>>>>
>> >>>>> My question then is: since this is a somewhat common query in our
>> >>>>> system,
>> >>>>> are there any workarounds I could use in mysql to improve things?
>> >>>>>
>> >>>>> I did find one myself, but it's a bit convoluted and might not
>> perform
>> >>>>> well
>> >>>>> under load, but the following sql script gives me similar
>> performance
>> >>>>> in
>> >>>>> mysql:
>> >>>>>
>> >>>>> create temporary table email_sent_100
>> >>>>> select a.user_id
>> >>>>> from user_activity_events a
>> >>>>> where a.activity = 'email_sent'
>> >>>>>
>> >>>>>
>> >>>>> and a.email_id = 100;
>> >>>>>
>> >>>>> create index user_id_idx on email_sent_100(user_id); //this could
>> >>>>> potentially bring the runtime down in the case of a larg temp table.
>> >>>>>
>> >>>>> select count(u.id)
>> >>>>> from users u
>> >>>>> left outer join email_sent_100 s
>> >>>>> on u.id = s.user_id
>> >>>>> and s.user_id is null;
>> >>>>>
>> >>>>> A lot more lines and a lot more complex, but does the job in this
>> >>>>> example.
>> >>>>>
>> >>>>> I'd appreciate your thoughts.
>> >>>>>
>> >>>>> Cheers,
>> >>>>> Leonardo Borges
>> >>>>> www.leonardoborges.com
>> >>>>>
>> >>>>>
>> >>>>
>> >>>
>> >>>
>> >>> --
>> >>> -----------------------------
>> >>> Johnny Withers
>> >>> 601.209.4985
>> >>> johnny@pixelated.net
>> >>>
>> >>
>> >>
>> >
>> >
>> > --
>> > -----------------------------
>> > Johnny Withers
>> > 601.209.4985
>> > johnny@pixelated.net
>> >
>>
>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe: http://lists.mysql.com/mysql?**
> unsub=leonardoborges.rj@gmail.**com
>
>

--bcaec53961dc354fc704a7df8fe9--