self-joins in hierarchical queries: optimization problem
self-joins in hierarchical queries: optimization problem
am 29.10.2009 16:29:58 von Olga Lyashevska
Dear all,
I have a table which contains taxonomic data (species, genera, family,
order, class) and it is organized as adjacency list model.
mysql> select* from taxonomic_units1 limit 5;
+-----+-------------------------------------------+--------- +
| tsn | name | parent_tsn | rank_id |
+-----+--------------------------+--------------+---------+
| 50 | Bacteria | 0 | 10 |
| 51 | Schizomycetes 202421 | 60 |
| 52 | Archangiaceae | 51 | 140 |
| 53 | Pseudomonadale | 51 | 100 |
| 54 | Rhodobacteriineae | 53 | 110 |
+-----+-----------------------------+------------+---------+
I am trying to flatten it, so that it can be used in further analysis
(e.g. in R)
I have been trying to run the following query, and it does what I want
it to do, but it takes really long time to get it done.
As a matter of fact I was not patient enough to get the whole output
and instead set LIMIT 10.
SELECT O1.name AS tclass, O2.name AS torder, O4.name AS tfamily,
O5.name AS tgenus, O6.name AS tspecies
FROM taxonomic_units1 AS O1
LEFT OUTER JOIN
taxonomic_units1 AS O2
ON O1.tsn = O2.parent_tsn
LEFT OUTER JOIN
taxonomic_units1 AS O3
ON O2.tsn = O3.parent_tsn
LEFT OUTER JOIN
taxonomic_units1 AS O4
ON O3.tsn = O4.parent_tsn
LEFT OUTER JOIN
taxonomic_units1 AS O5
ON O4.tsn = O5.parent_tsn
LEFT OUTER JOIN
taxonomic_units1 AS O6
ON O5.tsn = O6.parent_tsn
LIMIT 10;
+---------------+-----------------+------------------+------ --------
+----------------------------------------------------------- +
| tclass | torder |
tfamily | tgenus |
tspecies |
+---------------+-----------------+------------------+------ --------
+----------------------------------------------------------- +
| Bacteria | NULL |
NULL | NULL |
NULL |
| Schizomycetes | Archangiaceae | NULL |
NULL | NULL |
| Schizomycetes | Pseudomonadales | NULL |
NULL | NULL |
| Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter
| Nitrobacteragilis |
| Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter
| Nitrobacterflavus |
| Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter
| Nitrobacteroligotrophis |
| Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter
| Nitrobacterpolytrophus |
| Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter
| Nitrobacterpunctata |
I have checked this query with EXPLAIN, and it is not using any
indices, even though column tsn is set as index in original table.
+----+-------------+-------+------+---------------+------+-- -------
+------+--------+-------+
| id | select_type | table | type | possible_keys | key | key_len |
ref | rows | Extra |
+----+-------------+-------+------+---------------+------+-- -------
+------+--------+-------+
| 1 | SIMPLE | O1 | ALL | NULL | NULL | NULL |
NULL | 483305 | |
| 1 | SIMPLE | O2 | ALL | NULL | NULL | NULL |
NULL | 483305 | |
| 1 | SIMPLE | O3 | ALL | NULL | NULL | NULL |
NULL | 483305 | |
| 1 | SIMPLE | O4 | ALL | NULL | NULL | NULL |
NULL | 483305 | |
| 1 | SIMPLE | O5 | ALL | NULL | NULL | NULL |
NULL | 483305 | |
| 1 | SIMPLE | O6 | ALL | NULL | NULL | NULL |
NULL | 483305 | |
+----+-------------+-------+------+---------------+------+-- -------
+------+--------+-------+
6 rows in set (0.00 sec)
What is wrong with this query? Or is it a problem of all adjacency
list models?
Is there a way to get columns indexed using self-joins?
Thanks,
Olga
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org
Re: self-joins in hierarchical queries: optimization problem
am 29.10.2009 16:46:11 von kabel
Not sure if this is the exact problem you're trying to solve, but this helped
me in a similar situation.
http://dev.mysql.com/tech-resources/articles/hierarchical-da ta.html
kabel
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org
Re: self-joins in hierarchical queries: optimization problem
am 29.10.2009 16:55:04 von Olga Lyashevska
Thanks Kabel,
> Not sure if this is the exact problem you're trying to solve, but
> this helped
> me in a similar situation.
>
> http://dev.mysql.com/tech-resources/articles/hierarchical-da ta.html
Yes, I have seen this article before, and it is really nice. However
they do not discuss any optimization issues.
Now I was reading Joe Celko's book. Probably my query is simply not
the best solution for this type of problem.
Thanks,
Olga
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org
Re: self-joins in hierarchical queries: optimization problem
am 29.10.2009 17:53:25 von Sergey Petrunya
Hi Olga,
On Thu, Oct 29, 2009 at 03:29:58PM +0000, Olga Lyashevska wrote:
> I have a table which contains taxonomic data (species, genera, family,
> order, class) and it is organized as adjacency list model.
>
> mysql> select* from taxonomic_units1 limit 5;
> +-----+-------------------------------------------+--------- +
> | tsn | name | parent_tsn | rank_id |
> +-----+--------------------------+--------------+---------+
> | 50 | Bacteria | 0 | 10 |
> | 51 | Schizomycetes 202421 | 60 |
> | 52 | Archangiaceae | 51 | 140 |
> | 53 | Pseudomonadale | 51 | 100 |
> | 54 | Rhodobacteriineae | 53 | 110 |
> +-----+-----------------------------+------------+---------+
>
> I am trying to flatten it, so that it can be used in further analysis
> (e.g. in R)
> I have been trying to run the following query, and it does what I want
> it to do, but it takes really long time to get it done.
> As a matter of fact I was not patient enough to get the whole output and
> instead set LIMIT 10.
>
> SELECT O1.name AS tclass, O2.name AS torder, O4.name AS tfamily, O5.name
> AS tgenus, O6.name AS tspecies
> FROM taxonomic_units1 AS O1
> LEFT OUTER JOIN
> taxonomic_units1 AS O2
> ON O1.tsn = O2.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O3
> ON O2.tsn = O3.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O4
> ON O3.tsn = O4.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O5
> ON O4.tsn = O5.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O6
> ON O5.tsn = O6.parent_tsn
> LIMIT 10;
>
> +---------------+-----------------+------------------+------ --------
> +----------------------------------------------------------- +
> | tclass | torder | tfamily
> | tgenus | tspecies
> |
> +---------------+-----------------+------------------+------ --------
> +----------------------------------------------------------- +
> | Bacteria | NULL | NULL
> | NULL | NULL |
> | Schizomycetes | Archangiaceae | NULL |
> NULL | NULL |
> | Schizomycetes | Pseudomonadales | NULL | NULL
> | NULL |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacteragilis |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacterflavus |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacteroligotrophis |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacterpolytrophus |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacterpunctata |
>
>
> I have checked this query with EXPLAIN, and it is not using any indices,
> even though column tsn is set as index in original table.
>
> +----+-------------+-------+------+---------------+------+-- -------
> +------+--------+-------+
> | id | select_type | table | type | possible_keys | key | key_len | ref
> | rows | Extra |
> +----+-------------+-------+------+---------------+------+-- -------
> +------+--------+-------+
> | 1 | SIMPLE | O1 | ALL | NULL | NULL | NULL |
> NULL | 483305 | |
> | 1 | SIMPLE | O2 | ALL | NULL | NULL | NULL |
> NULL | 483305 | |
> | 1 | SIMPLE | O3 | ALL | NULL | NULL | NULL |
> NULL | 483305 | |
> | 1 | SIMPLE | O4 | ALL | NULL | NULL | NULL |
> NULL | 483305 | |
> | 1 | SIMPLE | O5 | ALL | NULL | NULL | NULL |
> NULL | 483305 | |
> | 1 | SIMPLE | O6 | ALL | NULL | NULL | NULL |
> NULL | 483305 | |
> +----+-------------+-------+------+---------------+------+-- -------
> +------+--------+-------+
> 6 rows in set (0.00 sec)
>
>
> What is wrong with this query? Or is it a problem of all adjacency list
> models?
> Is there a way to get columns indexed using self-joins?
For an outer join like
... taxonomic_units1 AS O1
LEFT OUTER JOIN taxonomic_units1 AS O2
ON O1.tsn = O2.parent_tsn
current optimizer has only one option(*): use Nested-Loops Join algorthm, with
the outer table being the first one. That is, it will execute these loops:
for each record from O1
for each record in O2 such that O1.tsn = O2.parent_tsn
...
this makes it clear that index on O1.tsn will not be useful. You need indexes
on parent_tsn column.
(*) certain kinds of WHERE clause may open other opportinites.
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org
Re: self-joins in hierarchical queries: optimization problem
am 29.10.2009 17:58:29 von Sergey Petrunya
On Thu, Oct 29, 2009 at 07:53:25PM +0300, Sergey Petrunya wrote:
> ... taxonomic_units1 AS O1
> LEFT OUTER JOIN taxonomic_units1 AS O2
> ON O1.tsn = O2.parent_tsn
>
> current optimizer has only one option(*): use Nested-Loops Join algorthm, with
> the outer table being the first one. That is, it will execute these loops:
>
> for each record from O1
> for each record in O2 such that O1.tsn = O2.parent_tsn
> ...
>
> this makes it clear that index on O1.tsn will not be useful. You need indexes
> on parent_tsn column.
I meant, one index over taxonomic_units1.parent_tsn .
BR
Sergey
--
Sergey Petrunia, Software Developer
Monty Program AB, http://askmonty.org
Blog: http://s.petrunia.net/blog
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org
Re: self-joins in hierarchical queries: optimization problem
am 29.10.2009 18:04:39 von Mikhail Berman
Olga,
Would you post "SHOW CREATE TABLE taxonomic_units1\G;"?
It should give us more info on the table you are dealing with
Regards,
Mikhail Berman
Olga Lyashevska wrote:
> Dear all,
>
> I have a table which contains taxonomic data (species, genera, family,
> order, class) and it is organized as adjacency list model.
>
> mysql> select* from taxonomic_units1 limit 5;
> +-----+-------------------------------------------+--------- +
> | tsn | name | parent_tsn | rank_id |
> +-----+--------------------------+--------------+---------+
> | 50 | Bacteria | 0 | 10 |
> | 51 | Schizomycetes 202421 | 60 |
> | 52 | Archangiaceae | 51 | 140 |
> | 53 | Pseudomonadale | 51 | 100 |
> | 54 | Rhodobacteriineae | 53 | 110 |
> +-----+-----------------------------+------------+---------+
>
> I am trying to flatten it, so that it can be used in further analysis
> (e.g. in R)
> I have been trying to run the following query, and it does what I want
> it to do, but it takes really long time to get it done.
> As a matter of fact I was not patient enough to get the whole output
> and instead set LIMIT 10.
>
> SELECT O1.name AS tclass, O2.name AS torder, O4.name AS tfamily,
> O5.name AS tgenus, O6.name AS tspecies
> FROM taxonomic_units1 AS O1
> LEFT OUTER JOIN
> taxonomic_units1 AS O2
> ON O1.tsn = O2.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O3
> ON O2.tsn = O3.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O4
> ON O3.tsn = O4.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O5
> ON O4.tsn = O5.parent_tsn
> LEFT OUTER JOIN
> taxonomic_units1 AS O6
> ON O5.tsn = O6.parent_tsn
> LIMIT 10;
>
> +---------------+-----------------+------------------+------ --------+--------------------------------------------------- --------+
>
> | tclass | torder | tfamily | tgenus | tspecies |
> +---------------+-----------------+------------------+------ --------+--------------------------------------------------- --------+
>
> | Bacteria | NULL | NULL | NULL | NULL |
> | Schizomycetes | Archangiaceae | NULL | NULL | NULL |
> | Schizomycetes | Pseudomonadales | NULL | NULL | NULL |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacteragilis |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacterflavus |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacteroligotrophis |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacterpolytrophus |
> | Schizomycetes | Pseudomonadales | Nitrobacteraceae | Nitrobacter |
> Nitrobacterpunctata |
>
>
> I have checked this query with EXPLAIN, and it is not using any
> indices, even though column tsn is set as index in original table.
>
> +----+-------------+-------+------+---------------+------+-- -------+------+--------+-------+
>
> | id | select_type | table | type | possible_keys | key | key_len |
> ref | rows | Extra |
> +----+-------------+-------+------+---------------+------+-- -------+------+--------+-------+
>
> | 1 | SIMPLE | O1 | ALL | NULL | NULL | NULL | NULL | 483305 | |
> | 1 | SIMPLE | O2 | ALL | NULL | NULL | NULL | NULL | 483305 | |
> | 1 | SIMPLE | O3 | ALL | NULL | NULL | NULL | NULL | 483305 | |
> | 1 | SIMPLE | O4 | ALL | NULL | NULL | NULL | NULL | 483305 | |
> | 1 | SIMPLE | O5 | ALL | NULL | NULL | NULL | NULL | 483305 | |
> | 1 | SIMPLE | O6 | ALL | NULL | NULL | NULL | NULL | 483305 | |
> +----+-------------+-------+------+---------------+------+-- -------+------+--------+-------+
>
> 6 rows in set (0.00 sec)
>
>
> What is wrong with this query? Or is it a problem of all adjacency
> list models?
> Is there a way to get columns indexed using self-joins?
>
> Thanks,
> Olga
>
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org
Re: self-joins in hierarchical queries: optimization problem
am 30.10.2009 11:39:36 von Olga Lyashevska
Dear Michail and Sergey,
Thank you very much for your responses and kind suggestions!
On 29.10.2009, at 16:53, Sergey Petrunya wrote:
> this makes it clear that index on O1.tsn will not be useful. You
> need indexes
> on parent_tsn column.
mysql> alter table taxonomic_units1 add index (parent_tsn);
Query OK, 483305 rows affected (7.76 sec)
Records: 483305 Duplicates: 0 Warnings: 0
And it is solved! It works like a charm!
mysql> CREATE TABLE flatfile
-> SELECT O1.name AS tclass, O2.name AS torder, O4.name AS
tfamily, O5.name AS tgenus, O6.name AS tspecies
-> FROM taxonomic_units1 AS O1
-> LEFT OUTER JOIN
-> taxonomic_units1 AS O2
-> ON O1.tsn = O2.parent_tsn
-> LEFT OUTER JOIN
-> taxonomic_units1 AS O3
-> ON O2.tsn = O3.parent_tsn
-> LEFT OUTER JOIN
-> taxonomic_units1 AS O4
-> ON O3.tsn = O4.parent_tsn
-> LEFT OUTER JOIN
-> taxonomic_units1 AS O5
-> ON O4.tsn = O5.parent_tsn
-> LEFT OUTER JOIN
-> taxonomic_units1 AS O6
-> ON O5.tsn = O6.parent_tsn;
Query OK, 2051444 rows affected (2 min 10.96 sec)
Records: 2051444 Duplicates: 0 Warnings: 0
My next task here is to match tspecies with another list of species
from a different table and display only those which appear in both
tables.
It can be done:
mysql> select flatfile.tclass, flatfile.torder, flatfile.tfamily,
flatfile.tgenus, flatfile.tspecies from flatfile, marinespecies where
tspecies=speciesmarine;
I wonder if it is not a better idea to incorporate this query into the
first one, perhaps in a form of subquery?
Thanks again,
Olga
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org