FW: Join based upon LIKE

am 28.04.2011 21:28:52 von Jerry Schwartz

No takers?

-----Original Message-----
From: Jerry Schwartz [mailto:jerry@gii.co.jp]
Sent: Monday, April 25, 2011 2:34 PM
To: 'Mailing-List mysql'
Subject: Join based upon LIKE

I have to match lists of new publications against our database, so that I can
replace the existing publications in our catalog. For example,

"The UK Market for Puppies in February 2011"

would be a replacement for

"The UK Market for Puppies in December 2010"

Unfortunately, the publishers aren't particularly careful with their titles.
One might even say they are perverse. I am likely to get

"UK Market: Puppies - Feb 2011"

as replacement for

"The UK Market for Puppies in December 2010"

You can see that a straight match by title is not going to work.

Here's what I've been doing:

=====

SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');

CREATE TEMPORARY TABLE new_titles (
new_title VARCHAR(255), INDEX (new_title),
new_title_like VARCHAR(255), INDEX (new_title_like)
);

INSERT INTO new_titles
VALUES

('Alternative Energy Monthly Deal Analysis - M&A and Investment Trends, April
2011', 'Alternative Energy Monthly Deal Analysis%M&A%Investment Trends%'),
('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company
Share, Price Trends, Capacity Forecasts of All Active and Planned Plants',
'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price
Trends%Capacity Forecasts of All Active%Planned Plants'),
....
('Underground Gas Storage Industry Outlook in North America, 2011 - Details of
All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas Storage
Industry Outlook%North America%Details of All Operating%Planned Gas Storage
Sites to%'),
('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017',
'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');

SELECT prod.prod_title AS `Title IN Database`,
new_titles.new_title AS `Title IN Feed`,
prod.prod_num AS `ID`
FROM new_titles JOIN prod ON prod.prod_title LIKE (new_titles.new_title_like)
AND prod.pub_id = @PUBID AND prod.prod_discont = 0
ORDER BY new_titles.new_title;
======

(I've written code that substitutes "%" for certain strings that I specify,
and there is some trial and error involved.)

Here's how MySQL handles that SELECT:

*************************** 1. row ***************************
id: 1
select_type: SIMPLE
table: new_titles
type: ALL
possible_keys: NULL
key: NULL
key_len: NULL
ref: NULL
rows: 47
Extra: Using filesort
*************************** 2. row ***************************
id: 1
select_type: SIMPLE
table: prod
type: ref
possible_keys: pub_id
key: pub_id
key_len: 48
ref: const
rows: 19607
Extra: Using where
=====

Here's the important part of the table `prod`:

=====

Table: prod
Create Table: CREATE TABLE `prod` (
`prod_id` varchar(15) NOT NULL DEFAULT '',
`prod_num` mediumint(6) unsigned DEFAULT NULL,
`prod_title` varchar(255) DEFAULT NULL,
`prod_type` varchar(2) DEFAULT NULL,
`prod_vat_pct` decimal(5,2) DEFAULT NULL,
`prod_discont` tinyint(1) DEFAULT NULL,
`prod_replacing` mediumint(6) unsigned DEFAULT NULL,
`prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
`prod_ready` tinyint(1) DEFAULT NULL,
`pub_id` varchar(15) DEFAULT NULL,
....
PRIMARY KEY (`prod_id`),
UNIQUE KEY `prod_num` (`prod_num`),
KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
KEY `pub_id` (`pub_id`),
KEY `prod_title` (`prod_title`),
FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

=====

This works reasonably well for a small number (perhaps 200-300) of new
products; but now I've been handed a list of over 15000 to stuff into the
table `new_titles`! This motivates me to wonder if there is a better way,
since I expect this to take a very long time.

Suggestions?

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join based upon LIKE

am 28.04.2011 22:18:18 von Johan De Meersman

----- Original Message -----
> From: "Jerry Schwartz"
>
> No takers?

Not willingly, no :-p

This is a pretty complex problem, as SQL itself isn't particularly well-equipped to deal with fuzzy data. One approach that might work is using a fulltext indexing engine (MySQL's built-in ft indices, or an external one like Solr or something) and doing best-fit matches on the keywords of the title you're looking for.

--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Join based upon LIKE

am 29.04.2011 00:07:34 von Jerry Schwartz

>-----Original Message-----
>From: Johan De Meersman [mailto:vegivamp@tuxera.be]
>Sent: Thursday, April 28, 2011 4:18 PM
>To: Jerry Schwartz
>Cc: mysql mailing list
>Subject: Re: Join based upon LIKE
>
>
>----- Original Message -----
>> From: "Jerry Schwartz"
>>
>> No takers?
>
>Not willingly, no :-p
>
>This is a pretty complex problem, as SQL itself isn't particularly well-
>equipped to deal with fuzzy data. One approach that might work is using a
>fulltext indexing engine (MySQL's built-in ft indices, or an external one
>like
>Solr or something) and doing best-fit matches on the keywords of the title
>you're looking for.
>
[JS] This isn't the only place I have to deal with fuzzy data. :-( Discretion
prohibits further comment.

A full-text index would work if I were only looking for one title at a time,
but I don't know if that would be a good idea if I have a list of 10000
titles. That would pretty much require either 10000 separate queries or a
very, very long WHERE clause.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

>
>--
>Bier met grenadyn
>Is als mosterd by den wyn
>Sy die't drinkt, is eene kwezel
>Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join based upon LIKE

am 29.04.2011 11:56:05 von Johan De Meersman

----- Original Message -----
> From: "Jerry Schwartz"
>
> [JS] This isn't the only place I have to deal with fuzzy data. :-(
> Discretion prohibits further comment.

Heh. What you *really* need, is a LART. Preferably one of the spiked variety.

> A full-text index would work if I were only looking for one title at
> a time, but I don't know if that would be a good idea if I have a list of
> 10000 titles. That would pretty much require either 10000 separate queries
> or a very, very long WHERE clause.

Yes, unfortunately. You should see if you can introduce a form of data normalisation - say, shadow fields with corrected entries, or functionality in the application that suggests correct entries based on what the user typed.

Or, if the money's there, you could have a look at Amazon Mechanical Turk (yes, really) for cheap-ish data correction.

--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Join based upon LIKE

am 29.04.2011 19:09:00 von Jerry Schwartz

>-----Original Message-----
>From: Johan De Meersman [mailto:vegivamp@tuxera.be]
>Sent: Friday, April 29, 2011 5:56 AM
>To: Jerry Schwartz
>Cc: mysql mailing list
>Subject: Re: Join based upon LIKE
>
>
>----- Original Message -----
>> From: "Jerry Schwartz"
>>
>> [JS] This isn't the only place I have to deal with fuzzy data. :-(
>> Discretion prohibits further comment.
>
>Heh. What you *really* need, is a LART. Preferably one of the spiked variety.
>
[JS] Unless a LART is a demon of some kind, I don't know what it is.

>> A full-text index would work if I were only looking for one title at
>> a time, but I don't know if that would be a good idea if I have a list of
>> 10000 titles. That would pretty much require either 10000 separate queries
>> or a very, very long WHERE clause.
>
>Yes, unfortunately. You should see if you can introduce a form of data
>normalisation - say, shadow fields with corrected entries, or functionality
>in
>the application that suggests correct entries based on what the user typed.
>
[JS] Except for obvious misspellings and non-ASCII characters, I do not have
the freedom to muck with the text. If the data were created in-house, I could
correct it on the way in; but it comes from myriad other companies.

>Or, if the money's there, you could have a look at Amazon Mechanical Turk
>(yes,
>really) for cheap-ish data correction.
>
[JS] Again, I can't change the data. The titles are assigned by the
publishers. Think what would happen if Amazon decided to "fix" the titles of
books. "Ain't Misbehavin" would, at best, turn into "I am not misbehaving".

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

>--
>Bier met grenadyn
>Is als mosterd by den wyn
>Sy die't drinkt, is eene kwezel
>Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Join based upon LIKE

am 29.04.2011 19:09:00 von Jerry Schwartz

>-----Original Message-----
>From: Jim McNeely [mailto:jim@newcenturydata.com]
>Sent: Thursday, April 28, 2011 6:43 PM
>To: Jerry Schwartz
>Subject: Re: Join based upon LIKE
>
>It just smells wrong, a nicer system would have you joining on ID's of some
>kind so that spelling wouldn't matter. I don't know the full situation for
>you
>though.
>
[JS] That would be nice, wouldn't it.

In a nutshell, we sell publications. Publishers send us lists of publications.
Some are new, some replace previous editions. (Think of books, almanacs, and
newsletters.) Some publishers make do without any product IDs at all, but most
do use product IDs of some kind.

The problem is that the March edition of a publication might or might not have
the same product ID as the February edition. I try to match them both by
product ID and by title. Sometimes the title will "fuzzy match", but the ID
won't; sometimes the ID will match but the title won't; sometimes (if I'm
really lucky) they both match; and sometimes the ID matches one product and
the title matches another.

It's the fuzzy match by title that gives me fits:

- The title might have a date in it ("Rain in Spain in 2010 Q2"), but not
necessarily in a uniform way ("Rain in Spain Q3 2010").
- The title might have differences in wording or punctuation ("Rain in Spain -
2010Q2").
- The title might have simple misspellings ("Rain in Spian - Q2 2010").

I've written code that looks for troublesome constructs and replaces them with
"%": " in ", "-", " to ", "Q2", "2Q", and more and more. So "Rain in Spain -
2010 Q2" becomes "Rain%Spain%".

I shove those modified titles into a table and do a JOIN ON `prod_title` LIKE
`wild_title`.

This will miss actual misspellings ("Spain", "Spian"). It will also produce a
large number of false positives.

On the back end, I have other code that compares the new titles against the
titles retrieved by that query and decides if they are exact matches,
approximate matches (here I do use regular expressions, as well as lists of
known "bad boys"), or false positives. From there on, it's all hand work.

Pretty big nut, eh?

So that's why I need to use LIKE in my JOIN.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

>Jim McNeely
>
>On Apr 28, 2011, at 12:28 PM, Jerry Schwartz wrote:
>
>> No takers?
>>
>> -----Original Message-----
>> From: Jerry Schwartz [mailto:jerry@gii.co.jp]
>> Sent: Monday, April 25, 2011 2:34 PM
>> To: 'Mailing-List mysql'
>> Subject: Join based upon LIKE
>>
>> I have to match lists of new publications against our database, so that I
>> can
>> replace the existing publications in our catalog. For example,
>>
>> "The UK Market for Puppies in February 2011"
>>
>> would be a replacement for
>>
>> "The UK Market for Puppies in December 2010"
>>
>> Unfortunately, the publishers aren't particularly careful with their
>> titles.
>> One might even say they are perverse. I am likely to get
>>
>> "UK Market: Puppies - Feb 2011"
>>
>> as replacement for
>>
>> "The UK Market for Puppies in December 2010"
>>
>> You can see that a straight match by title is not going to work.
>>
>> Here's what I've been doing:
>>
>> =====
>>
>> SET @PUBID = (SELECT pub.pub_id FROM pub WHERE pub.pub_code = 'GD');
>>
>> CREATE TEMPORARY TABLE new_titles (
>> new_title VARCHAR(255), INDEX (new_title),
>> new_title_like VARCHAR(255), INDEX (new_title_like)
>> );
>>
>> INSERT INTO new_titles
>> VALUES
>>
>> ('Alternative Energy Monthly Deal Analysis - M&A and Investment Trends,
>> April
>> 2011', 'Alternative Energy Monthly Deal Analysis%M&A%Investment Trends%'),
>> ('Asia Pacific Propylene Industry Outlook to 2015 - Market Size, Company
>> Share, Price Trends, Capacity Forecasts of All Active and Planned Plants',
>> 'Asia Pacific Propylene Industry Outlook to%Market Size%Company Share%Price
>> Trends%Capacity Forecasts of All Active%Planned Plants'),
>> ...
>> ('Underground Gas Storage Industry Outlook in North America, 2011 - Details
>of
>> All Operating and Planned Gas Storage Sites to 2014', 'Underground Gas
>Storage
>> Industry Outlook%North America%Details of All Operating%Planned Gas Storage
>> Sites to%'),
>> ('Uveitis Therapeutics - Pipeline Assessment and Market Forecasts to 2017',
>> 'Uveitis Therapeutics%Pipeline Assessment%Market Forecasts to%');
>>
>> SELECT prod.prod_title AS `Title IN Database`,
>> new_titles.new_title AS `Title IN Feed`,
>> prod.prod_num AS `ID`
>> FROM new_titles JOIN prod ON prod.prod_title LIKE
>> (new_titles.new_title_like)
>> AND prod.pub_id = @PUBID AND prod.prod_discont = 0
>> ORDER BY new_titles.new_title;
>> ======
>>
>> (I've written code that substitutes "%" for certain strings that I specify,
>> and there is some trial and error involved.)
>>
>> Here's how MySQL handles that SELECT:
>>
>> *************************** 1. row ***************************
>> id: 1
>> select_type: SIMPLE
>> table: new_titles
>> type: ALL
>> possible_keys: NULL
>> key: NULL
>> key_len: NULL
>> ref: NULL
>> rows: 47
>> Extra: Using filesort
>> *************************** 2. row ***************************
>> id: 1
>> select_type: SIMPLE
>> table: prod
>> type: ref
>> possible_keys: pub_id
>> key: pub_id
>> key_len: 48
>> ref: const
>> rows: 19607
>> Extra: Using where
>> =====
>>
>> Here's the important part of the table `prod`:
>>
>> =====
>>
>> Table: prod
>> Create Table: CREATE TABLE `prod` (
>> `prod_id` varchar(15) NOT NULL DEFAULT '',
>> `prod_num` mediumint(6) unsigned DEFAULT NULL,
>> `prod_title` varchar(255) DEFAULT NULL,
>> `prod_type` varchar(2) DEFAULT NULL,
>> `prod_vat_pct` decimal(5,2) DEFAULT NULL,
>> `prod_discont` tinyint(1) DEFAULT NULL,
>> `prod_replacing` mediumint(6) unsigned DEFAULT NULL,
>> `prod_replaced_by` mediumint(6) unsigned DEFAULT NULL,
>> `prod_ready` tinyint(1) DEFAULT NULL,
>> `pub_id` varchar(15) DEFAULT NULL,
>> ...
>> PRIMARY KEY (`prod_id`),
>> UNIQUE KEY `prod_num` (`prod_num`),
>> KEY `prod_pub_prod_id` (`prod_pub_prod_id`),
>> KEY `pub_id` (`pub_id`),
>> KEY `prod_title` (`prod_title`),
>> FULLTEXT KEY `prod_title_fulltext` (`prod_title`)
>> ) ENGINE=MyISAM DEFAULT CHARSET=utf8
>>
>> =====
>>
>> This works reasonably well for a small number (perhaps 200-300) of new
>> products; but now I've been handed a list of over 15000 to stuff into the
>> table `new_titles`! This motivates me to wonder if there is a better way,
>> since I expect this to take a very long time.
>>
>> Suggestions?
>>
>> Regards,
>>
>> Jerry Schwartz
>> Global Information Incorporated
>> 195 Farmington Ave.
>> Farmington, CT 06032
>>
>> 860.674.8796 / FAX: 860.674.8341
>> E-mail: jerry@gii.co.jp
>> Web site: www.the-infoshop.com
>>
>>
>>
>>
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe:
>> http://lists.mysql.com/mysql?unsub=jim@newcenturydata.com
>>

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: FW: Join based upon LIKE

am 30.04.2011 05:18:42 von (Halász Sándor) hsv

>>>> 2011/04/28 15:28 -0400, Jerry Schwartz >>>>
No takers?
<<<<<<<<
And this is not real taking, because the algorithm of which I am thinking, the edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see the Wikipedia entry). The obvious implementation takes as many steps as the product of the two compared strings s length. On the other hand, a good implementation of "LIKE" costs the pattern s length added to all the strings against which it matches s length, a sum, not product, of lengths.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: FW: Join based upon LIKE

am 30.04.2011 17:01:11 von (Halász Sándor) hsv

>>>> 2011/04/28 15:28 -0400, Jerry Schwartz >>>>
No takers?
<<<<<<<<
And this is not real taking, because the algorithm of which I am thinking, the edit-distance (Levens(h)tein-distance) algorithm costs too much for you (see the Wikipedia entry), but it yields, I believe, much more nearly such answer as you want.

The obvious implementation takes as many steps as the product of the two compared strings s length. On the other hand, a good implementation of "LIKE" costs the pattern s length added to all the strings against which it matches s length, a sum, not product, of lengths.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join based upon LIKE

am 01.05.2011 10:00:39 von Johan De Meersman

----- Original Message -----
> From: "Jerry Schwartz"
>
> I shove those modified titles into a table and do a JOIN ON
> `prod_title` LIKE
> `wild_title`.

Roughly what I meant with the shadow fields, yes - keep your own set of data around :-)

I have little more to offer, then, I'm afraid. The soundex() algorithm may or may not be of some use to you; it offers comparison based (roughly) on pronounciation instead of spelling.

Apart from that, you have my deepest sympathy. I hope you can wake up from the nightmare soon :-)

--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Join based upon LIKE

am 02.05.2011 16:09:36 von Jerry Schwartz

>-----Original Message-----
>From: Johan De Meersman [mailto:vegivamp@tuxera.be]
>Sent: Sunday, May 01, 2011 4:01 AM
>To: Jerry Schwartz
>Cc: Jim McNeely; mysql mailing list
>Subject: Re: Join based upon LIKE
>
>
>----- Original Message -----
>> From: "Jerry Schwartz"
>>
>> I shove those modified titles into a table and do a JOIN ON
>> `prod_title` LIKE
>> `wild_title`.
>
>Roughly what I meant with the shadow fields, yes - keep your own set of data
>around :-)
>
>I have little more to offer, then, I'm afraid. The soundex() algorithm may or
>may not be of some use to you; it offers comparison based (roughly) on
>pronounciation instead of spelling.
>
[JS] I've thought about using soundex(), but I'm not quite sure how.

I didn't pursue it much because there are so many odd terms such as chemical
names, but perhaps I should give it a try in my infinite free time.

>Apart from that, you have my deepest sympathy. I hope you can wake up from
>the
>nightmare soon :-)
>
[JS] Thanks for your condolences.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

>--
>Bier met grenadyn
>Is als mosterd by den wyn
>Sy die't drinkt, is eene kwezel
>Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join based upon LIKE

am 03.05.2011 11:31:05 von Johan De Meersman

http://www.gedpage.com/soundex.html offers a simple explanation of what it does.

One possibility would be building a referential table with only a recordID and soundex column, unique over both; and filling that with the soundex of individual nonjunk words.

So, from the titles

1 | Rain in Spain
2 | Spain's Rain

you'd get

1 | R500
1 | S150
2 | S150
2 | R500

From thereon, you can see that all the same words have been used - ignoring a lot of spelling errors like Spian. Obviously not a magic solution, but it's a start.

----- Original Message -----
> From: "Jerry Schwartz"
> To: "Johan De Meersman"
> Cc: "Jim McNeely" , "mysql mailing list"
> Sent: Monday, 2 May, 2011 4:09:36 PM
> Subject: RE: Join based upon LIKE
>
> [JS] I've thought about using soundex(), but I'm not quite sure how.
>
> I didn't pursue it much because there are so many odd terms such as
> chemical
> names, but perhaps I should give it a try in my infinite free time.
>
>
> [JS] Thanks for your condolences.
>
> Regards,
>
> Jerry Schwartz
> Global Information Incorporated
> 195 Farmington Ave.
> Farmington, CT 06032
>
> 860.674.8796 / FAX: 860.674.8341
> E-mail: jerry@gii.co.jp
> Web site: www.the-infoshop.com
>

--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Join based upon LIKE

am 03.05.2011 16:52:18 von Jerry Schwartz

>-----Original Message-----
>From: Johan De Meersman [mailto:vegivamp@tuxera.be]
>Sent: Tuesday, May 03, 2011 5:31 AM
>To: Jerry Schwartz
>Cc: Jim McNeely; mysql mailing list; Johan De Meersman
>Subject: Re: Join based upon LIKE
>
>
>http://www.gedpage.com/soundex.html offers a simple explanation of what it
>does.
>
>One possibility would be building a referential table with only a recordID
>and
>soundex column, unique over both; and filling that with the soundex of
>individual nonjunk words.
>
>So, from the titles
>
>1 | Rain in Spain
>2 | Spain's Rain
>
>you'd get
>
>1 | R500
>1 | S150
>2 | S150
>2 | R500
>
>From thereon, you can see that all the same words have been used - ignoring a
>lot of spelling errors like Spian. Obviously not a magic solution, but it's a
>start.
>
[JS] Thanks.

I'm not sure that I could easily build a dictionary of non-junk words, since
some of these reports have titles like "Toluene Diisocyanate Market Outlook
2008", "Toluene Market Outlook 2008", and "Toluene: 2009 World Market Outlook
And Forecast (Special Crisis Edition)".

I shall ponder this when I am caught up, or (more likely) in the afterlife.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

>----- Original Message -----
>> From: "Jerry Schwartz"
>> To: "Johan De Meersman"
>> Cc: "Jim McNeely" , "mysql mailing list"
>
>> Sent: Monday, 2 May, 2011 4:09:36 PM
>> Subject: RE: Join based upon LIKE
>>
>> [JS] I've thought about using soundex(), but I'm not quite sure how.
>>
>> I didn't pursue it much because there are so many odd terms such as
>> chemical
>> names, but perhaps I should give it a try in my infinite free time.
>>
>>
>> [JS] Thanks for your condolences.
>>
>> Regards,
>>
>> Jerry Schwartz
>> Global Information Incorporated
>> 195 Farmington Ave.
>> Farmington, CT 06032
>>
>> 860.674.8796 / FAX: 860.674.8341
>> E-mail: jerry@gii.co.jp
>> Web site: www.the-infoshop.com
>>
>
>--
>Bier met grenadyn
>Is als mosterd by den wyn
>Sy die't drinkt, is eene kwezel
>Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join based upon LIKE

am 03.05.2011 16:59:01 von Johan De Meersman

----- Original Message -----
> From: "Jerry Schwartz"
>
> I'm not sure that I could easily build a dictionary of non-junk
> words, since

The traditional way is to build a database of junk words. The list tends to be shorter :-)

Think and/or/it/the/with/like/...

Percentages of mutual and non-mutual words between two titles should be a reasonable indicator of likeness. You could conceivably even assign value to individual words, so "polypropylbutanate" is more useful than "synergy" for comparison purposes.

All very theoretical, though, I haven't actually done much of it to this level. My experience in data mangling is limited to mostly should-be-fixed-format data like sports results.

--
Bier met grenadyn
Is als mosterd by den wyn
Sy die't drinkt, is eene kwezel
Hy die't drinkt, is ras een ezel

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join based upon LIKE

am 03.05.2011 22:08:19 von Shawn Wilson

--0015174ff3128afdca04a264b308
Content-Type: text/plain; charset=ISO-8859-1

I'm actually enjoying this discussion because I have the same type of issue.
However, I have done away with trying to do a full text search in favor of
making a table with unique fields where all fields should uniquely identify
the group. If I get a dupe, I can clean it up.

However, like you, they don't want me to mess with the original data. So,
what I have is another table with my good data that my table with my unique
data refers to. If a bad record is creased, I don't care I just create my
relationship to the table of data I know (read think - I rarely look at this
stuff) is good.

So, I have 4 fields that should be unique for a group. Two chats and two
ints. If three of these match a record in the 'good data' table - there's my
relationship. If two or less match, I create a new record in my 'good data'
table and log the event. (I haven't gotten to the logging part yet though,
easy enough just to look sense none of the fields in 'good data' should
match)

I'm thinking you might have to dig deeper than me to find 'good data' but I
think its there. Maybe isbn, name, publisher + address, price, average
pages, name of sales person, who you guys pay for the material, etc etc etc.

On May 3, 2011 10:59 AM, "Johan De Meersman" wrote:
>
>
> ----- Original Message -----
> > From: "Jerry Schwartz"
> >
> > I'm not sure that I could easily build a dictionary of non-junk
> > words, since
>
> The traditional way is to build a database of junk words. The list tends
to be shorter :-)
>
> Think and/or/it/the/with/like/...
>
> Percentages of mutual and non-mutual words between two titles should be a
reasonable indicator of likeness. You could conceivably even assign value to
individual words, so "polypropylbutanate" is more useful than "synergy" for
comparison purposes.
>
> All very theoretical, though, I haven't actually done much of it to this
level. My experience in data mangling is limited to mostly
should-be-fixed-format data like sports results.
>
>
> --
> Bier met grenadyn
> Is als mosterd by den wyn
> Sy die't drinkt, is eene kwezel
> Hy die't drinkt, is ras een ezel
>
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe: http://lists.mysql.com/mysql?unsub=ag4ve.us@gmail.com
>

--0015174ff3128afdca04a264b308--

RE: Join based upon LIKE

am 03.05.2011 22:34:59 von Jerry Schwartz

My situation is sounds rather simple. All I am doing is matching a spreadsheet
of products against our database. My job is to find any matches against
existing products and determine which ones are new, which ones are
replacements for older products, and which ones just need to have the
publication date (and page count, price, whatever) refreshed.

Publisher is no problem. What I have for each "feed" is a title and (most of
the time) an ISBN or other identification assigned by the publisher.

Matching by product ID is easy (assuming there aren't any mistakes in the
current or previous feeds); but the publisher might or might not change the
product ID when they update a report. That's why I also run a match by title,
and that's where all the trouble comes from.

The publisher might or might not include a mix of old and new products in a
feed. The publisher might change the title of an existing product, either on
purpose or by accident; they might simply be sloppy about their spelling; or
(and this is where it is critical) the title might include a reference to some
time period such as a year or a quarter.

I think we'd better pull the plug on this discussion. It doesn't seem like
there's a ready solution. Fortunately our database is small, and most feeds
are only a few hundred products.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

>-----Original Message-----
>From: shawn wilson [mailto:ag4ve.us@gmail.com]
>Sent: Tuesday, May 03, 2011 4:08 PM
>Cc: mysql mailing list
>Subject: Re: Join based upon LIKE
>
>I'm actually enjoying this discussion because I have the same type of issue.
>However, I have done away with trying to do a full text search in favor of
>making a table with unique fields where all fields should uniquely identify
>the group. If I get a dupe, I can clean it up.
>
>However, like you, they don't want me to mess with the original data. So,
>what I have is another table with my good data that my table with my unique
>data refers to. If a bad record is creased, I don't care I just create my
>relationship to the table of data I know (read think - I rarely look at this
>stuff) is good.
>
>So, I have 4 fields that should be unique for a group. Two chats and two
>ints. If three of these match a record in the 'good data' table - there's my
>relationship. If two or less match, I create a new record in my 'good data'
>table and log the event. (I haven't gotten to the logging part yet though,
>easy enough just to look sense none of the fields in 'good data' should
>match)
>
>I'm thinking you might have to dig deeper than me to find 'good data' but I
>think its there. Maybe isbn, name, publisher + address, price, average
>pages, name of sales person, who you guys pay for the material, etc etc etc.
>
>
>On May 3, 2011 10:59 AM, "Johan De Meersman" wrote:
>>
>>
>> ----- Original Message -----
>> > From: "Jerry Schwartz"
>> >
>> > I'm not sure that I could easily build a dictionary of non-junk
>> > words, since
>>
>> The traditional way is to build a database of junk words. The list tends
>to be shorter :-)
>>
>> Think and/or/it/the/with/like/...
>>
>> Percentages of mutual and non-mutual words between two titles should be a
>reasonable indicator of likeness. You could conceivably even assign value to
>individual words, so "polypropylbutanate" is more useful than "synergy" for
>comparison purposes.
>>
>> All very theoretical, though, I haven't actually done much of it to this
>level. My experience in data mangling is limited to mostly
>should-be-fixed-format data like sports results.
>>
>>
>> --
>> Bier met grenadyn
>> Is als mosterd by den wyn
>> Sy die't drinkt, is eene kwezel
>> Hy die't drinkt, is ras een ezel
>>
>> --
>> MySQL General Mailing List
>> For list archives: http://lists.mysql.com/mysql
>> To unsubscribe: http://lists.mysql.com/mysql?unsub=ag4ve.us@gmail.com
>>

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Join based upon LIKE

am 04.05.2011 00:20:58 von nuno.tavares

Dear Jerry,

I've been silently following this discussion because I've missed the
original question.

But from your last explanation, now it really looks you have a "data
quality" kind of issue, which is by far related with MySQL.

Indeed, in Data Quality, there is *never* a ready solution, because the
source is tipically chaotic....

May I suggest you to explore Google Refine? It seems to be able to
address all those issues quite nicely, and the clustering might solve
your problem at once. You shall know, however, how to export the tables
(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that.

Hope it helps,
-NT

Em 03-05-2011 21:34, Jerry Schwartz escreveu:
> My situation is sounds rather simple. All I am doing is matching a spreadsheet
> of products against our database. My job is to find any matches against
> existing products and determine which ones are new, which ones are
> replacements for older products, and which ones just need to have the
> publication date (and page count, price, whatever) refreshed.
>
> Publisher is no problem. What I have for each "feed" is a title and (most of
> the time) an ISBN or other identification assigned by the publisher.
>
> Matching by product ID is easy (assuming there aren't any mistakes in the
> current or previous feeds); but the publisher might or might not change the
> product ID when they update a report. That's why I also run a match by title,
> and that's where all the trouble comes from.
>
> The publisher might or might not include a mix of old and new products in a
> feed. The publisher might change the title of an existing product, either on
> purpose or by accident; they might simply be sloppy about their spelling; or
> (and this is where it is critical) the title might include a reference to some
> time period such as a year or a quarter.
>
> I think we'd better pull the plug on this discussion. It doesn't seem like
> there's a ready solution. Fortunately our database is small, and most feeds
> are only a few hundred products.
>
> Regards,
>
> Jerry Schwartz
> Global Information Incorporated
> 195 Farmington Ave.
> Farmington, CT 06032
>
> 860.674.8796 / FAX: 860.674.8341
> E-mail: jerry@gii.co.jp
> Web site: www.the-infoshop.com
>
>
>> -----Original Message-----
>> From: shawn wilson [mailto:ag4ve.us@gmail.com]
>> Sent: Tuesday, May 03, 2011 4:08 PM
>> Cc: mysql mailing list
>> Subject: Re: Join based upon LIKE
>>
>> I'm actually enjoying this discussion because I have the same type of issue.
>> However, I have done away with trying to do a full text search in favor of
>> making a table with unique fields where all fields should uniquely identify
>> the group. If I get a dupe, I can clean it up.
>>
>> However, like you, they don't want me to mess with the original data. So,
>> what I have is another table with my good data that my table with my unique
>> data refers to. If a bad record is creased, I don't care I just create my
>> relationship to the table of data I know (read think - I rarely look at this
>> stuff) is good.
>>
>> So, I have 4 fields that should be unique for a group. Two chats and two
>> ints. If three of these match a record in the 'good data' table - there's my
>> relationship. If two or less match, I create a new record in my 'good data'
>> table and log the event. (I haven't gotten to the logging part yet though,
>> easy enough just to look sense none of the fields in 'good data' should
>> match)
>>
>> I'm thinking you might have to dig deeper than me to find 'good data' but I
>> think its there. Maybe isbn, name, publisher + address, price, average
>> pages, name of sales person, who you guys pay for the material, etc etc etc.
>>
>>
>> On May 3, 2011 10:59 AM, "Johan De Meersman" wrote:
>>>
>>>
>>> ----- Original Message -----
>>>> From: "Jerry Schwartz"
>>>>
>>>> I'm not sure that I could easily build a dictionary of non-junk
>>>> words, since
>>>
>>> The traditional way is to build a database of junk words. The list tends
>> to be shorter :-)
>>>
>>> Think and/or/it/the/with/like/...
>>>
>>> Percentages of mutual and non-mutual words between two titles should be a
>> reasonable indicator of likeness. You could conceivably even assign value to
>> individual words, so "polypropylbutanate" is more useful than "synergy" for
>> comparison purposes.
>>>
>>> All very theoretical, though, I haven't actually done much of it to this
>> level. My experience in data mangling is limited to mostly
>> should-be-fixed-format data like sports results.
>>>
>>>
>>> --
>>> Bier met grenadyn
>>> Is als mosterd by den wyn
>>> Sy die't drinkt, is eene kwezel
>>> Hy die't drinkt, is ras een ezel
>>>
>>> --
>>> MySQL General Mailing List
>>> For list archives: http://lists.mysql.com/mysql
>>> To unsubscribe: http://lists.mysql.com/mysql?unsub=ag4ve.us@gmail.com
>>>
>
>
>
>

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Join based upon LIKE

am 05.05.2011 15:54:02 von Jerry Schwartz

>-----Original Message-----
>From: Nuno Tavares [mailto:nuno.tavares@dri.pt]
>Sent: Tuesday, May 03, 2011 6:21 PM
>To: mysql@lists.mysql.com
>Subject: Re: Join based upon LIKE
>
>Dear Jerry,
>
>I've been silently following this discussion because I've missed the
>original question.
>
>But from your last explanation, now it really looks you have a "data
>quality" kind of issue, which is by far related with MySQL.
>
[JS] Definitely -- but I have to work with the tools available. This is only
one part of the process, there is more trouble further on that is not related
to our database at all.

>Indeed, in Data Quality, there is *never* a ready solution, because the
>source is tipically chaotic....
>
>May I suggest you to explore Google Refine? It seems to be able to
>address all those issues quite nicely, and the clustering might solve
>your problem at once. You shall know, however, how to export the tables
>(or a usable JOIN) as a CSV, see SELECT ... INTO OUTFILE for that.
>
[JS] I never heard of Google Refine. Thanks for bringing to my attention.

>Hope it helps,
>-NT
[JS] Thank you.

Regards,

Jerry Schwartz
Global Information Incorporated
195 Farmington Ave.
Farmington, CT 06032

860.674.8796 / FAX: 860.674.8341
E-mail: jerry@gii.co.jp
Web site: www.the-infoshop.com

>
>Em 03-05-2011 21:34, Jerry Schwartz escreveu:
>> My situation is sounds rather simple. All I am doing is matching a
>spreadsheet
>> of products against our database. My job is to find any matches against
>> existing products and determine which ones are new, which ones are
>> replacements for older products, and which ones just need to have the
>> publication date (and page count, price, whatever) refreshed.
>>
>> Publisher is no problem. What I have for each "feed" is a title and (most
>> of
>> the time) an ISBN or other identification assigned by the publisher.
>>
>> Matching by product ID is easy (assuming there aren't any mistakes in the
>> current or previous feeds); but the publisher might or might not change the
>> product ID when they update a report. That's why I also run a match by
>> title,
>> and that's where all the trouble comes from.
>>
>> The publisher might or might not include a mix of old and new products in a
>> feed. The publisher might change the title of an existing product, either
>> on
>> purpose or by accident; they might simply be sloppy about their spelling;
>> or
>> (and this is where it is critical) the title might include a reference to
>some
>> time period such as a year or a quarter.
>>
>> I think we'd better pull the plug on this discussion. It doesn't seem like
>> there's a ready solution. Fortunately our database is small, and most feeds
>> are only a few hundred products.
>>
>> Regards,
>>
>> Jerry Schwartz
>> Global Information Incorporated
>> 195 Farmington Ave.
>> Farmington, CT 06032
>>
>> 860.674.8796 / FAX: 860.674.8341
>> E-mail: jerry@gii.co.jp
>> Web site: www.the-infoshop.com
>>
>>
>>> -----Original Message-----
>>> From: shawn wilson [mailto:ag4ve.us@gmail.com]
>>> Sent: Tuesday, May 03, 2011 4:08 PM
>>> Cc: mysql mailing list
>>> Subject: Re: Join based upon LIKE
>>>
>>> I'm actually enjoying this discussion because I have the same type of
>>> issue.
>>> However, I have done away with trying to do a full text search in favor of
>>> making a table with unique fields where all fields should uniquely
>>> identify
>>> the group. If I get a dupe, I can clean it up.
>>>
>>> However, like you, they don't want me to mess with the original data. So,
>>> what I have is another table with my good data that my table with my
>>> unique
>>> data refers to. If a bad record is creased, I don't care I just create my
>>> relationship to the table of data I know (read think - I rarely look at
>>> this
>>> stuff) is good.
>>>
>>> So, I have 4 fields that should be unique for a group. Two chats and two
>>> ints. If three of these match a record in the 'good data' table - there's
>>> my
>>> relationship. If two or less match, I create a new record in my 'good
>>> data'
>>> table and log the event. (I haven't gotten to the logging part yet though,
>>> easy enough just to look sense none of the fields in 'good data' should
>>> match)
>>>
>>> I'm thinking you might have to dig deeper than me to find 'good data' but
>>> I
>>> think its there. Maybe isbn, name, publisher + address, price, average
>>> pages, name of sales person, who you guys pay for the material, etc etc
>>> etc.
>>>
>>>
>>> On May 3, 2011 10:59 AM, "Johan De Meersman" wrote:
>>>>
>>>>
>>>> ----- Original Message -----
>>>>> From: "Jerry Schwartz"
>>>>>
>>>>> I'm not sure that I could easily build a dictionary of non-junk
>>>>> words, since
>>>>
>>>> The traditional way is to build a database of junk words. The list tends
>>> to be shorter :-)
>>>>
>>>> Think and/or/it/the/with/like/...
>>>>
>>>> Percentages of mutual and non-mutual words between two titles should be a
>>> reasonable indicator of likeness. You could conceivably even assign value
>>> to
>>> individual words, so "polypropylbutanate" is more useful than "synergy"
>>> for
>>> comparison purposes.
>>>>
>>>> All very theoretical, though, I haven't actually done much of it to this
>>> level. My experience in data mangling is limited to mostly
>>> should-be-fixed-format data like sports results.
>>>>
>>>>
>>>> --
>>>> Bier met grenadyn
>>>> Is als mosterd by den wyn
>>>> Sy die't drinkt, is eene kwezel
>>>> Hy die't drinkt, is ras een ezel
>>>>
>>>> --
>>>> MySQL General Mailing List
>>>> For list archives: http://lists.mysql.com/mysql
>>>> To unsubscribe: http://lists.mysql.com/mysql?unsub=ag4ve.us@gmail.com
>>>>
>>
>>
>>
>>
>
>
>--
>MySQL General Mailing List
>For list archives: http://lists.mysql.com/mysql
>To unsubscribe: http://lists.mysql.com/mysql?unsub=jerry@gii.co.jp

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org