Text similarity

Text similarity

am 28.09.2009 12:27:43 von Merlin Morgenstern

Hi there,

I am trying to find out similarity between 2 strings. Somehow the
similar_text function returns 33% similarity on strings that are not
even close and on the other hand it returns 21% on strings that have a
matching word.

E.G:

'gemütliche sofas'

Wohngemeinschaften - similarity: 33.333333333333
Sofas & Sessel - similarity: 31.25

I am using this code:
similar_text($data[txt], $categories[$i], $similarity);

Does anybody have an idea why it gives back 33% similarity on the first
string?

Thank you for any help,

Merlin

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Text similarity

am 28.09.2009 12:37:28 von Ashley Sheridan

On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
> Hi there,
>=20
> I am trying to find out similarity between 2 strings. Somehow the=20
> similar_text function returns 33% similarity on strings that are not=20
> even close and on the other hand it returns 21% on strings that have a=20
> matching word.
>=20
> E.G:
>=20
> 'gemütliche sofas'
>=20
> Wohngemeinschaften - similarity: 33.333333333333
> Sofas & Sessel - similarity: 31.25
>=20
> I am using this code:
> similar_text($data[txt], $categories[$i], $similarity);
>=20
> Does anybody have an idea why it gives back 33% similarity on the first=20
> string?
>=20
> Thank you for any help,
>=20
> Merlin
>=20

If you think about it, it makes sense.

Taking your three sentences above, 'Wohngemeinschaften' has more
characters similar towards the start of the string (you only have to go
4 characters in to start a match) whereas 'sofas' won't match the source
string until the 12th string in. Also, both test strings have the same
number of characters that match in order, although the ones that match
in 'Wohngemeinschaften' are separated by characters that do not match,
so I'm not sure what bearing this will have.

As noted on the manual page for this function, the similar_text()
function compares without regard to string length, and tends to only
really be accurate enough for larger excerpts of text.

Thanks,
Ash
http://www.ashleysheridan.co.uk




--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Text similarity

am 28.09.2009 13:07:38 von Merlin Morgenstern

Ashley Sheridan wrote:
> On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
>> Hi there,
>>
>> I am trying to find out similarity between 2 strings. Somehow the
>> similar_text function returns 33% similarity on strings that are not
>> even close and on the other hand it returns 21% on strings that have a
>> matching word.
>>
>> E.G:
>>
>> 'gemütliche sofas'
>>
>> Wohngemeinschaften - similarity: 33.333333333333
>> Sofas & Sessel - similarity: 31.25
>>
>> I am using this code:
>> similar_text($data[txt], $categories[$i], $similarity);
>>
>> Does anybody have an idea why it gives back 33% similarity on the first
>> string?
>>
>> Thank you for any help,
>>
>> Merlin
>>
>
> If you think about it, it makes sense.
>
> Taking your three sentences above, 'Wohngemeinschaften' has more
> characters similar towards the start of the string (you only have to go
> 4 characters in to start a match) whereas 'sofas' won't match the source
> string until the 12th string in. Also, both test strings have the same
> number of characters that match in order, although the ones that match
> in 'Wohngemeinschaften' are separated by characters that do not match,
> so I'm not sure what bearing this will have.
>
> As noted on the manual page for this function, the similar_text()
> function compares without regard to string length, and tends to only
> really be accurate enough for larger excerpts of text.
>
> Thanks,
> Ash
> http://www.ashleysheridan.co.uk
>
>
>

Sounds logical. Is there another function you suggest? I guess this is a
standard problem I am having here. I tried it with levenstein, but
similar results.

e.g levenstein (smaller = better):
Search for : Stellplatz für Wohnwagen gesucht
Stereoanlagen : 23
Wohnwagen, -mobile : 24
Sonstiges für Baby & Kind - : 25
Steuer & Finanzen - :25

How come stereoanlagen and the others shows up here?

Any idea how I could make this more accurate?

Thank you for any help, Merlin

--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Text similarity

am 28.09.2009 13:18:33 von Ashley Sheridan

On Mon, 2009-09-28 at 13:07 +0200, Merlin Morgenstern wrote:
>=20
> Ashley Sheridan wrote:
> > On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
> >> Hi there,
> >>
> >> I am trying to find out similarity between 2 strings. Somehow the=20
> >> similar_text function returns 33% similarity on strings that are not=20
> >> even close and on the other hand it returns 21% on strings that have a=
=20
> >> matching word.
> >>
> >> E.G:
> >>
> >> 'gemütliche sofas'
> >>
> >> Wohngemeinschaften - similarity: 33.333333333333
> >> Sofas & Sessel - similarity: 31.25
> >>
> >> I am using this code:
> >> similar_text($data[txt], $categories[$i], $similarity);
> >>
> >> Does anybody have an idea why it gives back 33% similarity on the firs=
t=20
> >> string?
> >>
> >> Thank you for any help,
> >>
> >> Merlin
> >>
> >=20
> > If you think about it, it makes sense.
> >=20
> > Taking your three sentences above, 'Wohngemeinschaften' has more
> > characters similar towards the start of the string (you only have to go
> > 4 characters in to start a match) whereas 'sofas' won't match the sourc=
e
> > string until the 12th string in. Also, both test strings have the same
> > number of characters that match in order, although the ones that match
> > in 'Wohngemeinschaften' are separated by characters that do not match,
> > so I'm not sure what bearing this will have.
> >=20
> > As noted on the manual page for this function, the similar_text()
> > function compares without regard to string length, and tends to only
> > really be accurate enough for larger excerpts of text.
> >=20
> > Thanks,
> > Ash
> > http://www.ashleysheridan.co.uk
> >=20
> >=20
> >=20
>=20
> Sounds logical. Is there another function you suggest? I guess this is a=20
> standard problem I am having here. I tried it with levenstein, but=20
> similar results.
>=20
> e.g levenstein (smaller =3D better):
> Search for : Stellplatz für Wohnwagen gesucht
> Stereoanlagen : 23
> Wohnwagen, -mobile : 24
> Sonstiges für Baby & Kind - : 25
> Steuer & Finanzen - :25
>=20
> How come stereoanlagen and the others shows up here?
>=20
> Any idea how I could make this more accurate?
>=20
> Thank you for any help, Merlin
>=20

I'm guessing it's to do with the position of characters within the
string. You could roll your own function, that does what you
specifically need.

Break down the lines into individual words.

Loop through the match string and see if the words exist within the
phrases you're searching in, and keep a count of all 'hits'. As you
loop, create a metaphone key (soundex might also help, but I think
metaphone will work fairly well for German) and check this against a
metaphone version of the phrases you're searching in. Keep a separate
count of metaphone matches.

At the end, any of the search phrases that have either type of count is
a match. Collate them, and order by solid matches (whole words) and then
by metaphone matches.

This is very simplified, and will need a lot of tweaking to get it
right, but it might be somewhere to start?

Thanks,
Ash
http://www.ashleysheridan.co.uk




--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php

Re: Text similarity

am 28.09.2009 19:25:52 von Tom Worster

On 9/28/09 7:07 AM, "Merlin Morgenstern" wrote:

>=20
>=20
> Ashley Sheridan wrote:
>> On Mon, 2009-09-28 at 12:27 +0200, Merlin Morgenstern wrote:
>>> Hi there,
>>>=20
>>> I am trying to find out similarity between 2 strings. Somehow the
>>> similar_text function returns 33% similarity on strings that are not
>>> even close and on the other hand it returns 21% on strings that have a
>>> matching word.
>>>=20
>>> E.G:
>>>=20
>>> 'gemütliche sofas'
>>>=20
>>> Wohngemeinschaften - similarity: 33.333333333333
>>> Sofas & Sessel - similarity: 31.25
>>>=20
>>> I am using this code:
>>> similar_text($data[txt], $categories[$i], $similarity);
>>>=20
>>> Does anybody have an idea why it gives back 33% similarity on the first
>>> string?
>>>=20
>>> Thank you for any help,
>>>=20
>>> Merlin
>>>=20
>>=20
>> If you think about it, it makes sense.
>>=20
>> Taking your three sentences above, 'Wohngemeinschaften' has more
>> characters similar towards the start of the string (you only have to go
>> 4 characters in to start a match) whereas 'sofas' won't match the source
>> string until the 12th string in. Also, both test strings have the same
>> number of characters that match in order, although the ones that match
>> in 'Wohngemeinschaften' are separated by characters that do not match,
>> so I'm not sure what bearing this will have.
>>=20
>> As noted on the manual page for this function, the similar_text()
>> function compares without regard to string length, and tends to only
>> really be accurate enough for larger excerpts of text.
>>=20
>> Thanks,
>> Ash
>> http://www.ashleysheridan.co.uk
>>=20
>>=20
>>=20
>=20
> Sounds logical. Is there another function you suggest? I guess this is a
> standard problem I am having here. I tried it with levenstein, but
> similar results.
>=20
> e.g levenstein (smaller =3D better):
> Search for : Stellplatz f=C31=8E4r Wohnwagen gesucht
> Stereoanlagen : 23
> Wohnwagen, -mobile : 24
> Sonstiges f=C31=8E4r Baby & Kind - : 25
> Steuer & Finanzen - :25
>=20
> How come stereoanlagen and the others shows up here?
>=20
> Any idea how I could make this more accurate?
>=20
> Thank you for any help, Merlin

as ashley pointed out, it's not a trivial problem.

if you are performing the tests against strings in a db table then a full
text index might help. see, e.g.:
http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

you could also check out the php sphinx client
http://us3.php.net/manual/en/book.sphinx.php

if you are writing your own solutions and using utf8, take care with
similar_text() or levenshtein(). i don't think they are designed for
multibyte strings. so if you are using utf8 they will probably report bigge=
r
differences that you might expect. i wrote my own limited
damerau-levenshtein function for utf8.

even if you're using a single byte encoding, i would guess they ignore a
locale's collation. so say you set a german locale, ü will be regarded as
different from both u and ue. again, if you are searching against against
strings in a db table, the dbms may understand collations properly.



--
PHP General Mailing List (http://www.php.net/)
To unsubscribe, visit: http://www.php.net/unsub.php