finding duplicate records

am 19.03.2007 04:49:23 von rgurganus

I'm trying to setup a way to find possible duplicate or near-duplicate
records of contact people. I have a table of 50k records doing an
inner join with itself, something like this:

SELECT p1.fields, p2.fields
FROM table AS p1 INNER JOIN table AS p2 ON p1.lastname = p2.lastname
WHERE p1.id <> p2.id AND
(other criteria)

The id and name and other criteria fields are indexed as well as they
can be. It works fine as long as I add a LIMIT statement to keep it
down to just 10 or 20 records, but if I take this off, it runs on
forever -- so I can't even tell how many potentail records there will
be. If the table grows to 100k or 200k, it'll only get worse.

So my question is -- are there any better ways to go about this?

Re: finding duplicate records

am 19.03.2007 09:29:35 von zac.carey

On Mar 19, 3:49 am, rgurga...@citidc.com wrote:
> I'm trying to setup a way to find possible duplicate or near-duplicate
> records of contact people. I have a table of 50k records doing an
> inner join with itself, something like this:
>
> SELECT p1.fields, p2.fields
> FROM table AS p1 INNER JOIN table AS p2 ON p1.lastname = p2.lastname
> WHERE p1.id <> p2.id AND
> (other criteria)
>
> The id and name and other criteria fields are indexed as well as they
> can be. It works fine as long as I add a LIMIT statement to keep it
> down to just 10 or 20 records, but if I take this off, it runs on
> forever -- so I can't even tell how many potentail records there will
> be. If the table grows to 100k or 200k, it'll only get worse.
>
> So my question is -- are there any better ways to go about this?

I think '<' would be better than '<>'.