[ENG] closer result

[ENG] closer result

am 12.12.2005 09:53:42 von Bob Bedford

I've a query where I've to find the "closest" result.

Let's say I've from a file this name:

Tamiya TA 03F Pro

I've to find in a database which contains such results:

Tamiya TA 04
Tamiya TA 03E
Tamiya TAi 04B
Tamiya TA 03B
Tamiya TAi 03F

In such case, I've to find the closest result, I mean the one that is
closest to the TA 03 F Pro I'm looking for.
In this case is TAi 03F (there is only a i more in this case).

How to do so ? I must have as many equal letters. I mean in this case it
doesn't matter if there is a i in the middle, the TA and the 03F is
corresponding at the model.

Is that possible with a general query ? as sometimes it can be something
else. I want to give more importance to the TA1 03F than the other results.

Thanks for help.

Bob

Re: [ENG] closer result

am 12.12.2005 10:26:58 von Christian Kirsch

Bob Bedford schrieb:
> I've a query where I've to find the "closest" result.
>
> Let's say I've from a file this name:
>
> Tamiya TA 03F Pro
>
> I've to find in a database which contains such results:
>
> Tamiya TA 04
> Tamiya TA 03E
> Tamiya TAi 04B
> Tamiya TA 03B
> Tamiya TAi 03F
>
> In such case, I've to find the closest result, I mean the one that is
> closest to the TA 03 F Pro I'm looking for.
> In this case is TAi 03F (there is only a i more in this case).

A word like 'closest' is meaningless if you don't specify a metric
that measures the distance.

If you're able to specify the metric inambiguously, you're all set.
Otherwise, you may have to think again.
>
> How to do so ? I must have as many equal letters. I mean in this case it
> doesn't matter if there is a i in the middle, the TA and the 03F is
> corresponding at the model.
>
What's "the model" here? The search string?

> Is that possible with a general query ? as sometimes it can be something
> else. I want to give more importance to the TA1 03F than the other results.

I don't see a TA1 03F in your example.

One simple metric might be to count the numbers of insertions and
deletions to get from the search string to the one in the database.
The minimum of these values could be your 'closest match'. Another
possibility would be to split up your data into several fields
('Tamiya', 'TA', '03F', 'Pro') and compare each field separately,
possibly using different weights for each field.

I doubt that there's a general solution to this question and suggest
that you look for possiblities to structure your data in your problem
domain.

Re: [ENG] closer result

am 12.12.2005 10:53:23 von Bob Bedford

Thanks for replying Christian
>> Tamiya TA 04
>> Tamiya TA 03E
>> Tamiya TAi 04B
>> Tamiya TA 03B
>> Tamiya TAi 03F
>>
>> In such case, I've to find the closest result, I mean the one that is
>> closest to the TA 03 F Pro I'm looking for.
>> In this case is TAi 03F (there is only a i more in this case).
>
> A word like 'closest' is meaningless if you don't specify a metric
> that measures the distance.
I've seen once the possibility to put a weight on result on SQL, where the
most "close" result is where more letters are in common.

> If you're able to specify the metric inambiguously, you're all set.
> Otherwise, you may have to think again.
>>
>> How to do so ? I must have as many equal letters. I mean in this case it
>> doesn't matter if there is a i in the middle, the TA and the 03F is
>> corresponding at the model.
>>
> What's "the model" here? The search string?
Yes in my case I've to find the closest string. In my case, TA 03F is
contained in TAi 03F so this record should be selected.
If I look for TA 03 then the closest should be TA 03B as it's the first with
TA 03 (the second is TA 03E)

>
>> Is that possible with a general query ? as sometimes it can be something
>> else. I want to give more importance to the TA1 03F than the other
>> results.
>
> I don't see a TA1 03F in your example.
it's TAi 03F not TA1 03F
>
> One simple metric might be to count the numbers of insertions and
> deletions to get from the search string to the one in the database.
> The minimum of these values could be your 'closest match'. Another
> possibility would be to split up your data into several fields
> ('Tamiya', 'TA', '03F', 'Pro') and compare each field separately,
> possibly using different weights for each field.
>
> I doubt that there's a general solution to this question and suggest
> that you look for possiblities to structure your data in your problem
> domain.
Impossible to change how the datas are stored. I've to try to get the best
matching result using given datas. I can't even change the coming datas to
check for, as they come from an other program wich I've no control on.

Any idea ?

Bob

Re: [ENG] closer result

am 12.12.2005 11:03:17 von Christian Kirsch

Bob Bedford schrieb:
> Thanks for replying Christian
>
>>>Tamiya TA 04
>>>Tamiya TA 03E
>>>Tamiya TAi 04B
>>>Tamiya TA 03B
>>>Tamiya TAi 03F
>>>
>>>In such case, I've to find the closest result, I mean the one that is
>>>closest to the TA 03 F Pro I'm looking for.
>>>In this case is TAi 03F (there is only a i more in this case).
>>
>>A word like 'closest' is meaningless if you don't specify a metric
>>that measures the distance.
>
> I've seen once the possibility to put a weight on result on SQL, where the
> most "close" result is where more letters are in common.
>

I'm not talking about what some SQL implementation might offer but
about *your* requirements. "more letters are in common" could be ok if
you don't care *where* the letters are - then "the USA" is considered
the same as "A theus".

> Yes in my case I've to find the closest string. In my case, TA 03F is
> contained in TAi 03F so this record should be selected.
> If I look for TA 03 then the closest should be TA 03B as it's the first with
> TA 03 (the second is TA 03E)
>

Ah. Now you're not only talking about metric but about ordering too.
You weren't before.

If you're able to write down your requirements in the form of an SQL
procedure or function - fine. Otherwise, you might want to do the
filtering in your application. In anycase, you have to write down
*something* - aka an algorithm.

>>
>>I doubt that there's a general solution to this question and suggest
>>that you look for possiblities to structure your data in your problem
>>domain.
>
> Impossible to change how the datas are stored. I've to try to get the best

Impossible is always a good point to start from.

> matching result using given datas. I can't even change the coming datas to
> check for, as they come from an other program wich I've no control on.

And you can't control how they are written into your tables? What
*can* you control, then?

Re: [ENG] closer result

am 12.12.2005 11:12:58 von Bob Bedford

"Christian Kirsch" a écrit dans le message de news:
dnjht5$nc$1@news.mind.de...
> Bob Bedford schrieb:
>> Thanks for replying Christian
>>
>>>>Tamiya TA 04
>>>>Tamiya TA 03E
>>>>Tamiya TAi 04B
>>>>Tamiya TA 03B
>>>>Tamiya TAi 03F
>>>>
>>>>In such case, I've to find the closest result, I mean the one that is
>>>>closest to the TA 03 F Pro I'm looking for.
>>>>In this case is TAi 03F (there is only a i more in this case).
>>>
>>>A word like 'closest' is meaningless if you don't specify a metric
>>>that measures the distance.
>>
>> I've seen once the possibility to put a weight on result on SQL, where
>> the
>> most "close" result is where more letters are in common.
>>
>
> I'm not talking about what some SQL implementation might offer but
> about *your* requirements. "more letters are in common" could be ok if
> you don't care *where* the letters are - then "the USA" is considered
> the same as "A theus".
>
>> Yes in my case I've to find the closest string. In my case, TA 03F is
>> contained in TAi 03F so this record should be selected.
>> If I look for TA 03 then the closest should be TA 03B as it's the first
>> with
>> TA 03 (the second is TA 03E)
>>
>
> Ah. Now you're not only talking about metric but about ordering too.
> You weren't before.
>
> If you're able to write down your requirements in the form of an SQL
> procedure or function - fine. Otherwise, you might want to do the
> filtering in your application. In anycase, you have to write down
> *something* - aka an algorithm.
>
>>>
>>>I doubt that there's a general solution to this question and suggest
>>>that you look for possiblities to structure your data in your problem
>>>domain.
>>
>> Impossible to change how the datas are stored. I've to try to get the
>> best
>
> Impossible is always a good point to start from.
>
>> matching result using given datas. I can't even change the coming datas
>> to
>> check for, as they come from an other program wich I've no control on.
>
> And you can't control how they are written into your tables? What
> *can* you control, then?
Not so much but the code for matching results.
The program that send datas is done by a third part company that doesn't it
for me but for the need of hundreds of clients.
I use a database coming also from a third part company. I've to try to match
datas coming from the 2 companies that, off course, don't have the same
structure of datas (as they are concurrent).

I've to match datas trying to get the "best result" as possible.
Probably this should be done in a PHP script (as it's for a website).
Filtering all possible datas then trying to get the best match. But I think
it would be possible to do in Mysql directly.

Bob

Re: [ENG] closer result

am 12.12.2005 11:17:50 von Christian Kirsch

Bob Bedford schrieb:
> "Christian Kirsch" a écrit dans le message de news:
> dnjht5$nc$1@news.mind.de...
>
>>Bob Bedford schrieb:
>>
>>>Thanks for replying Christian
>>>
>>>
>>>>>Tamiya TA 04
>>>>>Tamiya TA 03E
>>>>>Tamiya TAi 04B
>>>>>Tamiya TA 03B
>>>>>Tamiya TAi 03F
>>>>>
>>>>>In such case, I've to find the closest result, I mean the one that is
>>>>>closest to the TA 03 F Pro I'm looking for.
>>>>>In this case is TAi 03F (there is only a i more in this case).
>>>>
>>>>A word like 'closest' is meaningless if you don't specify a metric
>>>>that measures the distance.
>>>
>>>I've seen once the possibility to put a weight on result on SQL, where
>>>the
>>>most "close" result is where more letters are in common.
>>>
>>
>>I'm not talking about what some SQL implementation might offer but
>>about *your* requirements. "more letters are in common" could be ok if
>>you don't care *where* the letters are - then "the USA" is considered
>>the same as "A theus".
>>
>>
>>>Yes in my case I've to find the closest string. In my case, TA 03F is
>>>contained in TAi 03F so this record should be selected.
>>>If I look for TA 03 then the closest should be TA 03B as it's the first
>>>with
>>>TA 03 (the second is TA 03E)
>>>
>>
>>Ah. Now you're not only talking about metric but about ordering too.
>>You weren't before.
>>
>>If you're able to write down your requirements in the form of an SQL
>>procedure or function - fine. Otherwise, you might want to do the
>>filtering in your application. In anycase, you have to write down
>>*something* - aka an algorithm.
>>
>>
>>>>I doubt that there's a general solution to this question and suggest
>>>>that you look for possiblities to structure your data in your problem
>>>>domain.
>>>
>>>Impossible to change how the datas are stored. I've to try to get the
>>>best
>>
>>Impossible is always a good point to start from.
>>
>>
>>>matching result using given datas. I can't even change the coming datas
>>>to
>>>check for, as they come from an other program wich I've no control on.
>>
>>And you can't control how they are written into your tables? What
>>*can* you control, then?
>
> Not so much but the code for matching results.
> The program that send datas is done by a third part company that doesn't it
> for me but for the need of hundreds of clients.
> I use a database coming also from a third part company. I've to try to match
> datas coming from the 2 companies that, off course, don't have the same
> structure of datas (as they are concurrent).
>

What about using tables of your *own*, where you can just insert the
stuff coming from the other companies in some structure that suits
your needs?

> I've to match datas trying to get the "best result" as possible.
> Probably this should be done in a PHP script (as it's for a website).
> Filtering all possible datas then trying to get the best match. But I think
> it would be possible to do in Mysql directly.
>
As I said before: If you can define your metric in an SQL procedure or
function - yes. You could even write a C/++ function (see the chapter
on user-defined functions in the MySQL documentation). But I don't
think that MySQL does what you want out of the box.

Re: [ENG] closer result

am 12.12.2005 12:27:02 von Axel Schwenke

"Bob Bedford" wrote:
>
> The program that send datas is done by a third part company that doesn't it
> for me but for the need of hundreds of clients.
> I use a database coming also from a third part company. I've to try to match
> datas coming from the 2 companies that, off course, don't have the same
> structure of datas (as they are concurrent).

So I see no need to do this *in* the database. Typical "string
distance" functions like Levenshtein can't use indexes and are
therefore seldom implemented *in* the database. Other solutions
like bi- or trigram counting may benefit from storing preprocessed
data, however I don't know of any database that supports it.

So you have two choices:

1. (let somebody) write a string distance function for i.e. the
Levenshtein algorithm. Preferrably this function would be
implemented in C/C++ and loaded into MySQL as -> UDF

Then you could use this function as follows:

SELECT ... , Levenshtein(a.string, b.string) AS distance
FROM table1 AS a
JOIN table2 AS b
WHERE distance < $threshold
ORDER BY distance


2. Do the same outside the database. In both cases you have to test
the full product (= every combination of entries) of both data sets.
Some distance approaches are faster if you compare a single string
against a list of candidates. So if you task is "find best match
from list and take it if distance is small enough" this would be
for you.


But FIRST you should LEARN; google the following keywords

string distance
approximate string matching


XL

Re: [ENG] closer result

am 12.12.2005 12:36:38 von Christian Kirsch

Axel Schwenke schrieb:
> "Bob Bedford" wrote:
>
>>The program that send datas is done by a third part company that doesn't it
>>for me but for the need of hundreds of clients.
>>I use a database coming also from a third part company. I've to try to match
>>datas coming from the 2 companies that, off course, don't have the same
>>structure of datas (as they are concurrent).
>
>
> So I see no need to do this *in* the database. Typical "string
> distance" functions like Levenshtein can't use indexes and are
> therefore seldom implemented *in* the database. Other solutions
> like bi- or trigram counting may benefit from storing preprocessed
> data, however I don't know of any database that supports it.
>
> So you have two choices:
>
> 1. (let somebody) write a string distance function for i.e. the
> Levenshtein algorithm. Preferrably this function would be
> implemented in C/C++ and loaded into MySQL as -> UDF
>
> Then you could use this function as follows:
>
> SELECT ... , Levenshtein(a.string, b.string) AS distance
> FROM table1 AS a
> JOIN table2 AS b
> WHERE distance < $threshold
> ORDER BY distance
>
>
> 2. Do the same outside the database. In both cases you have to test
> the full product (= every combination of entries) of both data sets.
> Some distance approaches are faster if you compare a single string
> against a list of candidates. So if you task is "find best match
> from list and take it if distance is small enough" this would be
> for you.
>
>

Not quite (unless Levenshtein is not a real metric). Since metric(a,b)
= metric(b,a) and metric(a,a) = 0, it's sufficient to check for
(N*(N-1))/2 combinations. Which can still be a lot, but fortunately
less than N*N.

> But FIRST you should LEARN; google the following keywords
>
> string distance
> approximate string matching

I agree.

Re: [ENG] closer result

am 12.12.2005 14:05:40 von Axel Schwenke

Christian Kirsch wrote:
> Axel Schwenke schrieb:

>> SELECT ... , Levenshtein(a.string, b.string) AS distance
>> FROM table1 AS a
>> JOIN table2 AS b
>> WHERE distance < $threshold
>> ORDER BY distance
>>
>> 2. Do the same outside the database. In both cases you have to test
>> the full product (= every combination of entries) of both data sets.
>
> Not quite (unless Levenshtein is not a real metric). Since metric(a,b)
> = metric(b,a) and metric(a,a) = 0, it's sufficient to check for
> (N*(N-1))/2 combinations. Which can still be a lot, but fortunately
> less than N*N.

Nope. It's two independent datasets (at least to my understanding, so
my writing applies). In that case you cannot bypass half of the tests.
OTOH saving half of the work does not change the order of magnitude.
Its still O(n^2).


XL

Re: [ENG] closer result

am 12.12.2005 14:38:30 von Christian Kirsch

Axel Schwenke schrieb:
> Christian Kirsch wrote:
>
>>Axel Schwenke schrieb:
>
>
>>> SELECT ... , Levenshtein(a.string, b.string) AS distance
>>> FROM table1 AS a
>>> JOIN table2 AS b
>>> WHERE distance < $threshold
>>> ORDER BY distance
>>>
>>>2. Do the same outside the database. In both cases you have to test
>>> the full product (= every combination of entries) of both data sets.
>>
>>Not quite (unless Levenshtein is not a real metric). Since metric(a,b)
>> = metric(b,a) and metric(a,a) = 0, it's sufficient to check for
>>(N*(N-1))/2 combinations. Which can still be a lot, but fortunately
>>less than N*N.
>
>
> Nope. It's two independent datasets (at least to my understanding, so
> my writing applies). In that case you cannot bypass half of the tests.
> OTOH saving half of the work does not change the order of magnitude.
> Its still O(n^2).
>

Right. I didn't think properly.