performance of query for normalized DB

performance of query for normalized DB

am 19.10.2006 14:27:36 von phlype.johnson

I'm struggling to find the best query from performance point of view
and readability for a normalized DB design.
To illustrate better my question on whether normalized designs lead to
more complex queries yes or no, I have prepared an example. The example
is a database with the following tables:
*table person with fields:
-persid: autoincrement id
-name: name of the person
*table material with fields:
-materialid: autoincrement id
-material: name of the material eg "wood"
*table color with fields:
-colorid: autoincrement id
-color: name of the color eg "green"
*table persmaterial with fields:
-persmatid: autoincrement id
-persid: link to table person
-materialid: link to table material
*table perscolor with fields:
-perscolorid: autoincrement id
-persid: link to table person
-colorid: link to table color
Using these tables it is straightforward to store the preference of a
certain person for colors and materials.
Now use the statements at the end of the post to create the tables and
populate them with intial values. Next we want to find all persons who
like the colors red or blue and also like the color green. Furthermore
the person should have a preference for iron as material. I understand
there are several ways to obtain the requested result.
Option 1:
The initial query can be written in pseudocode as
Find all persons that (like as color (red OR blue) AND green) AND that
(like as material iron). I do not see right away how to write this as a
query with joins so I rewrite the pseudo query as follows:
Find all persons that ((like as color red AND green) AND that (like as
material iron)) OR ((like as color blue AND green) AND that (like as
material iron))
this leads to the following query
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid AND (pc.colorid=1 OR pc.colorid=2) AND
p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING
(count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
UNION
(SELECT p.persid FROM person p, perscolor pc, persmaterial pm WHERE
p.persid=pc.persid AND (pc.colorid=2 OR pc.colorid=3) AND
p.persid=pm.persid AND pm.materialid=2 GROUP BY p.persid HAVING
(count(DISTINCT pc.colorid)=2 AND count(DISTINCT pm.materialid)=1))
You can say that for someone familiar to sql this is not overly
complicated but the problem I see is the following. I want to use this
database in a webapplication where a user will be able to find persons
with certain preferences. The user will be allowed to play around with
AND and ORs in his request. To state it differently the user will have
the possibility to type in a pseudo query like the example I used
above. As you have seen above, to get to the final sql query I need to
do a transformation of the pseudo query to the sql query. Secondly, if
we have many criteria you can see easily that the number of unions can
grow exponentially if the user starts playing around with ANDs and ORs
for a certain property.

Option 2:
The pseudo query
Find all persons that (like as color (red OR blue) AND green) AND that
(like as material iron)
can straightforwardly be coded in a query using subqueries:
SELECT persid FROM person p WHERE
(EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=1 AND
p.persid=pc.persid)
OR
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=3 AND
p.persid=pc.persid))
AND
EXISTS(SELECT * FROM perscolor pc WHERE pc.colorid=2 AND
p.persid=pc.persid)
AND
EXISTS(SELECT * FROM persmaterial pm WHERE pm.materialid=2 AND
p.persid=pm.persid)

Both options are quite different. I am no expert but this is my
understanding of the pros and contras of the 2 options:
*option 1:
-difficult to automatically generate the query from the pseudo query
*option 2:
+easy to generate query from pseudo query
-are subqueries not inefficient? If for each subquery you have to go
over the parameters outside the subquery, it means pretty much that for
each subquery you go over the entire person table; I might be wrong and
subqueries might be more optimized but I have no idea on this

Overall, I would like to know
*if there are other options to translate the above pseudo query into an
sql query?
*what of the otpions (proposed + new ones) are best from a performance
point of view?

-- phpMyAdmin SQL Dump
-- version 2.6.1
-- http://www.phpmyadmin.net
--
-- Host: localhost
-- Generation Time: Oct 19, 2006 at 01:13 PM
-- Server version: 4.1.9
-- PHP Version: 4.3.10
--
-- Database: `aston`
--

-- --------------------------------------------------------

--
-- Table structure for table `color`
--

CREATE TABLE `color` (
`colorid` int(11) NOT NULL auto_increment,
`color` varchar(30) NOT NULL default '',
PRIMARY KEY (`colorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=5 ;

--
-- Dumping data for table `color`
--

INSERT INTO `color` VALUES (1, 'red');
INSERT INTO `color` VALUES (2, 'green');
INSERT INTO `color` VALUES (3, 'blue');
INSERT INTO `color` VALUES (4, 'yellow');

-- --------------------------------------------------------

--
-- Table structure for table `material`
--

CREATE TABLE `material` (
`materialid` int(11) NOT NULL auto_increment,
`material` varchar(30) NOT NULL default '',
PRIMARY KEY (`materialid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=3 ;

--
-- Dumping data for table `material`
--

INSERT INTO `material` VALUES (1, 'wood');
INSERT INTO `material` VALUES (2, 'iron');

-- --------------------------------------------------------

--
-- Table structure for table `perscolor`
--

CREATE TABLE `perscolor` (
`perscolorid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`colorid` int(11) NOT NULL default '0',
PRIMARY KEY (`perscolorid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=7 ;

--
-- Dumping data for table `perscolor`
--

INSERT INTO `perscolor` VALUES (1, 1, 1);
INSERT INTO `perscolor` VALUES (2, 1, 2);
INSERT INTO `perscolor` VALUES (3, 2, 1);
INSERT INTO `perscolor` VALUES (5, 3, 3);
INSERT INTO `perscolor` VALUES (6, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `persmaterial`
--

CREATE TABLE `persmaterial` (
`persmatid` int(11) NOT NULL auto_increment,
`persid` int(11) NOT NULL default '0',
`materialid` int(11) NOT NULL default '0',
PRIMARY KEY (`persmatid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=6 ;

--
-- Dumping data for table `persmaterial`
--

INSERT INTO `persmaterial` VALUES (1, 1, 1);
INSERT INTO `persmaterial` VALUES (2, 1, 2);
INSERT INTO `persmaterial` VALUES (3, 2, 1);
INSERT INTO `persmaterial` VALUES (5, 3, 2);

-- --------------------------------------------------------

--
-- Table structure for table `person`
--

CREATE TABLE `person` (
`persid` int(11) NOT NULL auto_increment,
`name` varchar(30) NOT NULL default '',
PRIMARY KEY (`persid`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1 AUTO_INCREMENT=4 ;

--
-- Dumping data for table `person`
--

INSERT INTO `person` VALUES (1, 'john');
INSERT INTO `person` VALUES (2, 'emily');
INSERT INTO `person` VALUES (3, 'liz');