splitting large tables vertically

am 09.05.2009 14:44:21 von Kyong Kim

I was wondering about a scale out problem.
Lets say you have a large table with 3 cols and 500+ million rows.

Would there be much benefit in splitting the columns into different table=
s
based on INT type primary keys across the tables? The split tables will b=
e
hosted on a same physical instance but can be spread over multiple disks.
We're also open to splitting the query and reconstituting the data at the
application layer such as

select col1, col2 from t1 where col2=3D'name';
select col2 from t2 where col1=3Dt1.col1;
select col2 from t3 where col1=3Dt1.col1;

as opposed to

select t1.col2, t2.col2, t3.col2 from t1 inner join t2 on t1.col1=3Dt2.co=
l1
inner join t3 on t1.col1=3Dt3.col1;

My concern to this approach is the overhead of joins of such large number
of rows. I was doing some research into the cost of joins and as of 5.0,
the joins were still nested loop scans.

I was wondering if there are others with practical experience in this
matter and what they've found. Any feedback will be much appreciated.

Kyong
Inst. Web Programmer
CMDBA 5.0

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: splitting large tables vertically

am 09.05.2009 21:23:47 von mos

Do the 3 tables have different column structures? Or do they all have the
same table structure? For example, is Table1 storing only data for year
1990 and table 2 storing data for 1991 etc? If so you could use a merge
table. (Or do you need transactions, in which case you will need to use
InnoDb and merge tables are not possible)

Is there always a 1:1 relationship between the tables? If so, I'd recommend
storing all the data in 1 table. Joins are really expensive and are to be
avoided if at all possible possible.

One way I found to make joins faster when retrieving under 5k rows, is to
select the rows from each table and put them into a temporary Memory table,
build the primary index on each memory table, and then use the memory
tables for the join. This seems to be more work, but it is about twice as
fast as joining the physical tables together.

Hope this helps. If you could provide us with more information as how you
will be adding and retrieving the data (how many rows are being retrieved
in a query?), perhaps we can narrow down the solution for you.

Mike

At 07:44 AM 5/9/2009, Kyong Kim wrote:
>I was wondering about a scale out problem.
>Lets say you have a large table with 3 cols and 500+ million rows.
>
>Would there be much benefit in splitting the columns into different tables
>based on INT type primary keys across the tables? The split tables will be
>hosted on a same physical instance but can be spread over multiple disks.
>We're also open to splitting the query and reconstituting the data at the
>application layer such as
>
>select col1, col2 from t1 where col2='name';
>select col2 from t2 where col1=t1.col1;
>select col2 from t3 where col1=t1.col1;
>
>as opposed to
>
>select t1.col2, t2.col2, t3.col2 from t1 inner join t2 on t1.col1=t2.col1
>inner join t3 on t1.col1=t3.col1;
>
>My concern to this approach is the overhead of joins of such large number
>of rows. I was doing some research into the cost of joins and as of 5.0,
>the joins were still nested loop scans.
>
>I was wondering if there are others with practical experience in this
>matter and what they've found. Any feedback will be much appreciated.
>
>Kyong
>Inst. Web Programmer
>CMDBA 5.0
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=mos99@fastmail.fm

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: splitting large tables vertically

am 10.05.2009 11:15:30 von Simon J Mudd

kimkyong@fhda.edu ("Kyong Kim") writes:

> I was wondering about a scale out problem.
> Lets say you have a large table with 3 cols and 500+ million rows.
>
> Would there be much benefit in splitting the columns into different tables
> based on INT type primary keys across the tables?

To answer your question properly requires more information:

1. Expected table structure. Can you show the current CREATE TABLE xxx\G output?
2. Expected use cases to extract data?
3. Do you expect to delete data frequently, or are you only inserting data,
or is there a mix of inserts and deletes? If so provide more info.

I've come across situations where a large table like this caused lots
of problems. There were lots of concurrent "delete batches" (cleaning
up) and at the same time lot of inserts. At the same time there were
large groups of selects to collect certain sets of data for
presentation. Perhaps you are doing something similar? If you do
something similar you may find that it's extremely important to get
the keys right especially the primary keys so that data retrieval (for
SELECTs or DELETEs) is as fast as possible (using clustered indexes
[PRIMARY KEY in innodb]). If not or if the queries overlap you may
find performance degredation a big issue as Innobase manages the locks
to ensure that the concurrent statements don't interfere.

You can also use merge tables sitting on top of MyISAM "per year" or
"per whatever" data in each table. That avoids you having to find data
for 2009 as you look in table xxx_2009, so this can be a big
win. MyISAM has the inconvenience that if the server ever crashes
recovery of these tables can be very timeconsuming. Innodb has a
larger footprint for the same data.

So it's hard without more information on the structure and the use
cases to answer your question. In fact if you have the time, try out
and benchmark different approaches and see which is best for your
requirements. Just remember that as the data grows the initial
measurements may not be consistent with behaviour you see later. Also
if you are looking at a large amount of data like this appropriate
server tuning can influence performance significantly.

Hope this helps.

Simon

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: splitting large tables vertically

am 10.05.2009 18:49:09 von Kyong Kim

Simon,
Thanks for the feedback.
I don't have all the details of the schema and workload. Just an
interesting idea that was presented to me.
I think the idea is to split a lengthy secondary key lookup into 2 primar=
y
key lookups and reduce the cost of clustering secondary key with primary
key data by using a shorter INT type surrogate key. Another downside is
the possible need of foreign keys and added complexity of insertions and
multi-column updates.

Have you found primary key lookups to be at least twice as fast as
secondary key lookups with VARCHAR type primary key in InnoDB? The whole
idea is based on the assumption that it is.

Also, MyISAM conversion is an option too. Have you found the table
maintenance to be a significant overhead? I've experienced MyISAM table
corruptions in production and I'm more inclined to go with InnoDB for its
reliability. This is a fairly important table.

Any insight would be much appreciated.
Kyong

> kimkyong@fhda.edu ("Kyong Kim") writes:
>
>> I was wondering about a scale out problem.
>> Lets say you have a large table with 3 cols and 500+ million rows.
>>
>> Would there be much benefit in splitting the columns into different
>> tables
>> based on INT type primary keys across the tables?
>
> To answer your question properly requires more information:
>
> 1. Expected table structure. Can you show the current CREATE TABLE xxx\=
G
> output?
> 2. Expected use cases to extract data?
> 3. Do you expect to delete data frequently, or are you only inserting
> data,
> or is there a mix of inserts and deletes? If so provide more info.
>
> I've come across situations where a large table like this caused lots
> of problems. There were lots of concurrent "delete batches" (cleaning
> up) and at the same time lot of inserts. At the same time there were
> large groups of selects to collect certain sets of data for
> presentation. Perhaps you are doing something similar? If you do
> something similar you may find that it's extremely important to get
> the keys right especially the primary keys so that data retrieval (for
> SELECTs or DELETEs) is as fast as possible (using clustered indexes
> [PRIMARY KEY in innodb]). If not or if the queries overlap you may
> find performance degredation a big issue as Innobase manages the locks
> to ensure that the concurrent statements don't interfere.
>
> You can also use merge tables sitting on top of MyISAM "per year" or
> "per whatever" data in each table. That avoids you having to find data
> for 2009 as you look in table xxx_2009, so this can be a big
> win. MyISAM has the inconvenience that if the server ever crashes
> recovery of these tables can be very timeconsuming. Innodb has a
> larger footprint for the same data.
>
> So it's hard without more information on the structure and the use
> cases to answer your question. In fact if you have the time, try out
> and benchmark different approaches and see which is best for your
> requirements. Just remember that as the data grows the initial
> measurements may not be consistent with behaviour you see later. Also
> if you are looking at a large amount of data like this appropriate
> server tuning can influence performance significantly.
>
> Hope this helps.
>
> Simon
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dkimkyong@fhda.e=
du
>

Inst. Web Programmer
CMDBA 5.0

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: splitting large tables vertically

am 10.05.2009 23:40:57 von Simon J Mudd

kimkyong@fhda.edu ("Kyong Kim") writes:

> I don't have all the details of the schema and workload. Just an
> interesting idea that was presented to me.
> I think the idea is to split a lengthy secondary key lookup into 2 primary
> key lookups and reduce the cost of clustering secondary key with primary
> key data by using a shorter INT type surrogate key. Another downside is
> the possible need of foreign keys and added complexity of insertions and
> multi-column updates.
>
> Have you found primary key lookups to be at least twice as fast as
> secondary key lookups with VARCHAR type primary key in InnoDB? The whole
> idea is based on the assumption that it is.

That's why you really need to be more precise in the data structures
you are planning on using. This can change the results significantly.

So no, I don't have any specific answers to your questions as you don't
provide any specific information in what you ask.

> Also, MyISAM conversion is an option too. Have you found the table
> maintenance to be a significant overhead? I've experienced MyISAM table
> corruptions in production and I'm more inclined to go with InnoDB for its
> reliability. This is a fairly important table.

Well disk (and memory) usage can also be important so as it seems
InnoDB storage is "less efficient" this may actually degrade
performance. Until you are more concrete it's hard to say what will
work best for you.

Simon

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: splitting large tables vertically

am 11.05.2009 00:33:35 von Kyong Kim

> That's why you really need to be more precise in the data structures
> you are planning on using. This can change the results significantly.
>
> So no, I don't have any specific answers to your questions as you don't
> provide any specific information in what you ask.

Yeah. Let me see if I can follow up with more concrete information
sometime in future. I find performance tuning to be workload dependent an=
d
it is difficult to project without having all the details.

> Well disk (and memory) usage can also be important so as it seems
> InnoDB storage is "less efficient" this may actually degrade
> performance. Until you are more concrete it's hard to say what will
> work best for you.

At this point I'm fairly convinced that this idea of vertical paritioning
a table into column tables will degrade performance unless the workload i=
s
tailor-made for this.
The cost of joins and index lookup/column data seems a bit too high for
almost any scenario.
Thanks for the prompt response. I'll follow up with you if I have more
concrete details.

Thanks
Kyong

>
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dkimkyong@fhda.e=
du
>

Inst. Web Programmer
CMDBA 5.0

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg