Join based upon LIKE
am 25.04.2011 20:33:46 von Jerry SchwartzI have to match lists of new publications against our database, so that I can
replace the existing publications in our catalog. For example,
"The UK Market for Puppies in February 2011"
would be a replacement for
"The UK Market for Puppies in December 2010"
Unfortunately, the publishers aren't particularly careful with their titles.
One might even say they are perverse. I am likely to get
"UK Market: Puppies - Feb 2011"
as replacement for
"The UK Market for Puppies in December 2010"
You can see that a straight match by title is not going to work.
Here's what I've been doing:
=====
SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');
CREATE TEMPORARY TABLE new_titles (
new_title VARCHAR(255), INDEX (new_title),
new_title_like VARCHAR(255), INDEX (new_title_like)
);
INSERT INTO new_titles
VALUES
('Alternative Energy Monthly Deal Analysis - M&A and Investment Trends, April
2011', 'Alternative Energy Monthly Deal Analysis%M&A%Investment Trends%'),
('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company
Share, Price Trends, Capacity Forecasts of All Active and Planned Plants',
'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price
Trends%Capacity Forecasts of All Active%Planned Plants'),
....
('Underground Gas Storage Industry Outlook in North America, 2011 - Details of
All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage
Industry Outlook%North America%Details of All Operating%Planned Gas Storage
Sites to%'),
('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017',
'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');
SELECT prod.prod_title AS `Title IN Database`,
new_titles.new_title AS `Title IN Feed`,
prod.prod_num AS `ID`
FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like)
AND prod.pub_id = @PUBID AND prod.prod_discont = 0
ORDER BY new_titles.new_title;
======
(I've written code that substitutes "%" for certain strings that I specify,
and there is some trial and error involved.)
Here's how MySQL handles that SELECT:
*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: new_titles
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 47
Extra: Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: prod
type: ref
possible_keys: pub_id
key: pub_id
key_len: 48
ref: const
rows: 19607
Extra: Using where
=====
Here's the important part of the table `prod`:
=====
Table: prod
Create Table: CREATE TABLE `prod` (
`prod_id` varchar(15) NOT NULL DEFAULT '',
`prod_num` mediumint(6) unsigned DEFAULT NULL,
`prod_title` varchar(255) DEFAULT NULL,
`prod_type` varchar(2) DEFAULT NULL,
`prod_vat_pct` decimal(5,2) DEFAULT NULL,
`prod_discont` tinyint(1) DEFAULT NULL,
`prod_replacing` mediumint(6) unsigned DEFAULT NULL,
`prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
`prod_ready` tinyint(1) DEFAULT NULL,
`pub_id` varchar(15) DEFAULT NULL,
....
PRIMARY KEY (`prod_id`),
UNIQUE KEY `prod_num` (`prod_num`),
KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
KEY `pub_id` (`pub_id`),
KEY `prod_title` (`prod_title`),
FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8
=====
This works reasonably well for a small number (perhaps 200-300) of new
products; but now I've been handed a list of over 15000 to stuff into the
table `new_titles`! This motivates me to wonder if there is a better way,
since I expect this to take a very long time.
Suggestions?
Regards,
Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032
860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org