Removing Duplicate Records

Removing Duplicate Records

am 14.07.2009 15:57:30 von Matt Neimeyer

In our database we have an Organizations table and a Contacts table,
and a linking table that associates Contacts with Organizations.
Occassionally we manually add to the linking table with information
gleaned from outside data sources. This is common enough to be
annoying, since it ends up with duplicate linkages, but it's FAR from
an everyday occurance.

I have three options for dealing with the resulting duplicates and I
would appreciate some advice on which option might be best.

1. Attack the horrific spaghetti code that determines the Org and
Contact ids and then does the manual add. Creating Orgs and Contacts
as needed.

Calling this code horrific is a kindness... we're talking evil...
We've pretty much ruled this out due to the horror... but I mention
that I considered it.

2. Do a create table and populate that new table with the results of a
select distinct from the old table then swap the tables.

3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink
GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the
result "DELETE FROM OrgContLink WHERE OrgID=X AND ContID=Y LIMIT 1".
Then repeat until no results are found.

I like option 2 in so far as it's more... Atomic? One create...select,
one swap and its done. But even though it feels more "pure" I worry
that the overhead of completely creating and destroying entire tables
seems like throwing the baby out with the bathwater. IOW: Is
rebuilding a whole table for a few (hundred at most) offending
duplicate records overkill.

I like option 3 in that it leaves everything as is but does require a
lot of looping and feels inefficient. However, since we'd be running
this only after we do our imports it's not like this looping
inefficient code would be running all the time.

I know I could probably also put a unique key on both orgid and contid
but unless I'm reading the docs wrong I can't add a key in such a way
that the duplicate key insertion would silently fail without requiring
the offending application to do "INSERT ... ON DUPLICATE KEY..." which
gets back to modifying the spaghetti code from option 1.

Thanks in advance for your advice.

Matt

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

RE: Removing Duplicate Records

am 14.07.2009 16:16:30 von Nathan Sullivan

Matt,

If you went with option #3, you could avoid the looping by using (tCount - =
1) as the LIMIT in the delete statement instead of always using 1.


Regards,
Nathan

-----Original Message-----
From: Matt Neimeyer [mailto:matt@neimeyer.org]=20
Sent: Tuesday, July 14, 2009 8:58 AM
To: mysql@lists.mysql.com
Subject: Removing Duplicate Records

In our database we have an Organizations table and a Contacts table,
and a linking table that associates Contacts with Organizations.
Occassionally we manually add to the linking table with information
gleaned from outside data sources. This is common enough to be
annoying, since it ends up with duplicate linkages, but it's FAR from
an everyday occurance.

I have three options for dealing with the resulting duplicates and I
would appreciate some advice on which option might be best.

1. Attack the horrific spaghetti code that determines the Org and
Contact ids and then does the manual add. Creating Orgs and Contacts
as needed.

Calling this code horrific is a kindness... we're talking evil...
We've pretty much ruled this out due to the horror... but I mention
that I considered it.

2. Do a create table and populate that new table with the results of a
select distinct from the old table then swap the tables.

3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink
GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the
result "DELETE FROM OrgContLink WHERE OrgID=3DX AND ContID=3DY LIMIT 1".
Then repeat until no results are found.

I like option 2 in so far as it's more... Atomic? One create...select,
one swap and its done. But even though it feels more "pure" I worry
that the overhead of completely creating and destroying entire tables
seems like throwing the baby out with the bathwater. IOW: Is
rebuilding a whole table for a few (hundred at most) offending
duplicate records overkill.

I like option 3 in that it leaves everything as is but does require a
lot of looping and feels inefficient. However, since we'd be running
this only after we do our imports it's not like this looping
inefficient code would be running all the time.

I know I could probably also put a unique key on both orgid and contid
but unless I'm reading the docs wrong I can't add a key in such a way
that the duplicate key insertion would silently fail without requiring
the offending application to do "INSERT ... ON DUPLICATE KEY..." which
gets back to modifying the spaghetti code from option 1.

Thanks in advance for your advice.

Matt

--=20
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dnsullivan@cappex.co=
m


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=3Dgcdmg-mysql-2@m.gmane.o rg

Re: Removing Duplicate Records

am 14.07.2009 17:52:04 von Marcus Bointon

--Apple-Mail-11-844068557
Content-Type: text/plain;
charset=US-ASCII;
format=flowed;
delsp=yes
Content-Transfer-Encoding: 7bit

You can combine the two queries you have in option 3 (you'll need to
change field names, but you should get the idea), something like this:
DELETE table1 FROM table1, (SELECT MAX(id) AS dupid, COUNT(id) AS
dupcnt FROM table1 WHERE field1 IS NOT NULL GROUP BY link_id HAVING
dupcnt>1) AS dups WHERE table1.id=dups.dupid;
Marcus
--
Marcus Bointon
Synchromedia Limited: Creators of http://www.smartmessages.net/
UK resellers of info@hand CRM solutions
marcus@synchromedia.co.uk | http://www.synchromedia.co.uk/



--Apple-Mail-11-844068557
Content-Disposition: attachment;
filename=smime.p7s
Content-Type: application/pkcs7-signature;
name=smime.p7s
Content-Transfer-Encoding: base64

MIAGCSqGSIb3DQEHAqCAMIACAQExCzAJBgUrDgMCGgUAMIAGCSqGSIb3DQEH AQAAoIILHzCCBN0w
ggPFoAMCAQICEHGS++YZX6xNEoV0cTSiGKcwDQYJKoZIhvcNAQEFBQAwezEL MAkGA1UEBhMCR0Ix
GzAZBgNVBAgMEkdyZWF0ZXIgTWFuY2hlc3RlcjEQMA4GA1UEBwwHU2FsZm9y ZDEaMBgGA1UECgwR
Q29tb2RvIENBIExpbWl0ZWQxITAfBgNVBAMMGEFBQSBDZXJ0aWZpY2F0ZSBT ZXJ2aWNlczAeFw0w
NDAxMDEwMDAwMDBaFw0yODEyMzEyMzU5NTlaMIGuMQswCQYDVQQGEwJVUzEL MAkGA1UECBMCVVQx
FzAVBgNVBAcTDlNhbHQgTGFrZSBDaXR5MR4wHAYDVQQKExVUaGUgVVNFUlRS VVNUIE5ldHdvcmsx
ITAfBgNVBAsTGGh0dHA6Ly93d3cudXNlcnRydXN0LmNvbTE2MDQGA1UEAxMt VVROLVVTRVJGaXJz
dC1DbGllbnQgQXV0aGVudGljYXRpb24gYW5kIEVtYWlsMIIBIjANBgkqhkiG 9w0BAQEFAAOCAQ8A
MIIBCgKCAQEAsjmFpPJ9q0E7YkY3rs3BYHW8OWX5ShpHornMSMxqmNVNNRm5 pELlzkniii8efNIx
B8dOtINknS4p1aJkxIW9hVE1eaROaJB7HHqkkqgX8pgV8pPMyaQylbsMTzC9 mKALi+VuG6JG+ni8
om+rWV6lL8/K2m2qL+usobNqqrcuZzWLeeEeaYji5kbNoKXqvgvOdjp6Dpvq /NonWz1zHyLmSGHG
TPNpsaguG7bUMSAsvIKKjqQOpdeJQ/wWWq8dcdcRWdq6hw2v+vPhwvCkxWeM 1tZUOt4KpLoDd7Nl
yP0e03RiqhjKaJMeoYV+9Udly/hNVyh00jT/MLbu9mIwFIws6wIDAQABo4IB JzCCASMwHwYDVR0j
BBgwFoAUoBEKIz6W8Qfs4q8p74Klf9AwpLQwHQYDVR0OBBYEFImCZ33EnSZw AEu0UEh83j2uBG59
MA4GA1UdDwEB/wQEAwIBBjAPBgNVHRMBAf8EBTADAQH/MB0GA1UdJQQWMBQG CCsGAQUFBwMCBggr
BgEFBQcDBDARBgNVHSAECjAIMAYGBFUdIAAwewYDVR0fBHQwcjA4oDagNIYy aHR0cDovL2NybC5j
b21vZG9jYS5jb20vQUFBQ2VydGlmaWNhdGVTZXJ2aWNlcy5jcmwwNqA0oDKG MGh0dHA6Ly9jcmwu
Y29tb2RvLm5ldC9BQUFDZXJ0aWZpY2F0ZVNlcnZpY2VzLmNybDARBglghkgB hvhCAQEEBAMCAQYw
DQYJKoZIhvcNAQEFBQADggEBAJ2Vyzy4fqUJxB6/C8LHdo45PJTGEKpPDMng q4RdiVTgZTvzbRx8
NywlVF+WIfw3hJGdFdwUT4HPVB1rbEVgxy35l1FM+WbKPKCCjKbI8OLp1Er5 7D9Wyd12jMOCAU9s
APMeGmF0BEcDqcZAV5G8ZSLFJ2dPV9tkWtmNH7qGL/QGrpxp7en0zykX2OBK nxogL5dMUbtGB8SK
N04g4wkxaMeexIud6H4RvDJoEJYRmETYKlFgTYjrdDrfQwYyyDlWjDoRUtNB pEMD9O3vMyfbOeAU
TibJ2PU54om4k123KSZB6rObroP8d3XK6Mq1/uJlSmM+RMTQw16Hc6mYHK9/ FX8wggY6MIIFIqAD
AgECAhBx3xrKwDO5B7CQiO3prEe8MA0GCSqGSIb3DQEBBQUAMIGuMQswCQYD VQQGEwJVUzELMAkG
A1UECBMCVVQxFzAVBgNVBAcTDlNhbHQgTGFrZSBDaXR5MR4wHAYDVQQKExVU aGUgVVNFUlRSVVNU
IE5ldHdvcmsxITAfBgNVBAsTGGh0dHA6Ly93d3cudXNlcnRydXN0LmNvbTE2 MDQGA1UEAxMtVVRO
LVVTRVJGaXJzdC1DbGllbnQgQXV0aGVudGljYXRpb24gYW5kIEVtYWlsMB4X DTA4MTEyODAwMDAw
MFoXDTA5MTEyODIzNTk1OVowgeMxNTAzBgNVBAsTLENvbW9kbyBUcnVzdCBO ZXR3b3JrIC0gUEVS
U09OQSBOT1QgVkFMSURBVEVEMUYwRAYDVQQLEz1UZXJtcyBhbmQgQ29uZGl0 aW9ucyBvZiB1c2U6
IGh0dHA6Ly93d3cuY29tb2RvLm5ldC9yZXBvc2l0b3J5MR8wHQYDVQQLExYo YykyMDAzIENvbW9k
byBMaW1pdGVkMRcwFQYDVQQDEw5NYXJjdXMgQm9pbnRvbjEoMCYGCSqGSIb3 DQEJARYZbWFyY3Vz
QHN5bmNocm9tZWRpYS5jby51azCCASIwDQYJKoZIhvcNAQEBBQADggEPADCC AQoCggEBAJq12Fwn
mDc8j/J+fsyJFdXkxSsbex2n5eXKm0Tjwy+IPE2F5jxpaITqju9WtxMcsjtt P+CZfAdOQQD/R1fg
jPQLXDifxudsMogyOmX2HdoEpGFITEnes21U+hKTU87DlIM+hEZViuOqS8Ym Xj8oNISFYAa90lvs
nt7jtu/eyDdvrZJ0q6QnntRUaeSsz8ngV8Lqshmo/1qmz67JL77MfJgkSfEP 6fJkqb/e9kKjJ1Tn
HipP2PmZTQUnHwcOs/0uaSgOumgpgMyX8fiQ33VV/DGMcTl9P0oLB0GopAKw qlHoXPCJXFkX6LLi
Mm9bdoKUIW0kwnRx1HWjvDrih9UeMmECAwEAAaOCAhswggIXMB8GA1UdIwQY MBaAFImCZ33EnSZw
AEu0UEh83j2uBG59MB0GA1UdDgQWBBSiqul8UOEn/VVzRp+EXywCAx973TAO BgNVHQ8BAf8EBAMC
BaAwDAYDVR0TAQH/BAIwADAgBgNVHSUEGTAXBggrBgEFBQcDBAYLKwYBBAGy MQEDBQIwEQYJYIZI
AYb4QgEBBAQDAgUgMEYGA1UdIAQ/MD0wOwYMKwYBBAGyMQECAQEBMCswKQYI KwYBBQUHAgEWHWh0
dHBzOi8vc2VjdXJlLmNvbW9kby5uZXQvQ1BTMIGlBgNVHR8EgZ0wgZowTKBK oEiGRmh0dHA6Ly9j
cmwuY29tb2RvY2EuY29tL1VUTi1VU0VSRmlyc3QtQ2xpZW50QXV0aGVudGlj YXRpb25hbmRFbWFp
bC5jcmwwSqBIoEaGRGh0dHA6Ly9jcmwuY29tb2RvLm5ldC9VVE4tVVNFUkZp cnN0LUNsaWVudEF1
dGhlbnRpY2F0aW9uYW5kRW1haWwuY3JsMGwGCCsGAQUFBwEBBGAwXjA2Bggr BgEFBQcwAoYqaHR0
cDovL2NydC5jb21vZG9jYS5jb20vVVROQUFBQ2xpZW50Q0EuY3J0MCQGCCsG AQUFBzABhhhodHRw
Oi8vb2NzcC5jb21vZG9jYS5jb20wJAYDVR0RBB0wG4EZbWFyY3VzQHN5bmNo cm9tZWRpYS5jby51
azANBgkqhkiG9w0BAQUFAAOCAQEAKfnMTtYONyd2o63aCpSJs8GpopYaEU3k sEBgOAfTFe52yx+k
6eLyah2I0N+lnBmDdLDeR7Eel6zWjKZmaNW9r1h7lRkuzLt6VqpjbBPefOk3 UWkjwJ8j0opKrFSW
UXg+fghnuNVptw3STOyWQKk+fEKtiuf+Ygpnl+QnJWsVjSeasklYqLImXYmn IM3N4y/18UQFztXD
+Z+cn7ZLnBXEfSb/GatxP9JYuhLRTV1YMwtcVhGydHvJTzug+5+wBPdlI8L1 5YecgOR4JxoNJ5T9
Odt0TMPxxVi+m2pZw6Ni3fh8yeKoLsBreKTAX3IsKGsIPJzczQvnVeRrTLaH is9r2DGCA/wwggP4
AgEBMIHDMIGuMQswCQYDVQQGEwJVUzELMAkGA1UECBMCVVQxFzAVBgNVBAcT DlNhbHQgTGFrZSBD
aXR5MR4wHAYDVQQKExVUaGUgVVNFUlRSVVNUIE5ldHdvcmsxITAfBgNVBAsT GGh0dHA6Ly93d3cu
dXNlcnRydXN0LmNvbTE2MDQGA1UEAxMtVVROLVVTRVJGaXJzdC1DbGllbnQg QXV0aGVudGljYXRp
b24gYW5kIEVtYWlsAhBx3xrKwDO5B7CQiO3prEe8MAkGBSsOAwIaBQCgggIN MBgGCSqGSIb3DQEJ
AzELBgkqhkiG9w0BBwEwHAYJKoZIhvcNAQkFMQ8XDTA5MDcxNDE1NTIwNFow IwYJKoZIhvcNAQkE
MRYEFHQ79MGTEZBhFQ98FKR0PYeyYmxhMIHUBgkrBgEEAYI3EAQxgcYwgcMw ga4xCzAJBgNVBAYT
AlVTMQswCQYDVQQIEwJVVDEXMBUGA1UEBxMOU2FsdCBMYWtlIENpdHkxHjAc BgNVBAoTFVRoZSBV
U0VSVFJVU1QgTmV0d29yazEhMB8GA1UECxMYaHR0cDovL3d3dy51c2VydHJ1 c3QuY29tMTYwNAYD
VQQDEy1VVE4tVVNFUkZpcnN0LUNsaWVudCBBdXRoZW50aWNhdGlvbiBhbmQg RW1haWwCEHHfGsrA
M7kHsJCI7emsR7wwgdYGCyqGSIb3DQEJEAILMYHGoIHDMIGuMQswCQYDVQQG EwJVUzELMAkGA1UE
CBMCVVQxFzAVBgNVBAcTDlNhbHQgTGFrZSBDaXR5MR4wHAYDVQQKExVUaGUg VVNFUlRSVVNUIE5l
dHdvcmsxITAfBgNVBAsTGGh0dHA6Ly93d3cudXNlcnRydXN0LmNvbTE2MDQG A1UEAxMtVVROLVVT
RVJGaXJzdC1DbGllbnQgQXV0aGVudGljYXRpb24gYW5kIEVtYWlsAhBx3xrK wDO5B7CQiO3prEe8
MA0GCSqGSIb3DQEBAQUABIIBAEqcMhWP97wvhR6fI2RlGoOR80mMX26h7KCQ Vm21LoaMp/rDMd7E
LR8NjdYnxtjTZUiTrIxjhv3RM+WoSvp4lpmbENdgIaARCzy+rd8+I8ycKkDk z6/L+8l1W7O3N+Ar
+yCJfrDENMRNUWbh17ySCpU2EHnJcY03p8G1IhNc+pkqsH0bfjLWFGKcKNCb H91ud3Me5+JWuppG
HjcuuJcxnWLj4tyBBiRANvonQASuaSpVteW3Xiqn2dkXoBaJBIJI6AoQxLR7 0DzWut1tMSerAap2
tOdKfuOj+Mmxz+tFWDw/4WV/IoOi5uZYZ0SL0QnDFy1+R4l9CrRExtfKOW4b RasAAAAAAAA=

--Apple-Mail-11-844068557--

Re: Removing Duplicate Records

am 14.07.2009 19:42:57 von Brent Baisley

That's assuming that there is a unique identifier field, like an auto
increment field. Although that could be added after the fact. Also,
you need to run the query multiple times until it returns no affected
records. So if there are 4 copies of a record, it would need to be run
3 times to get rid of all the dups.
But I agree, that is the best way to remove duplicates in place
provided the table is not too large.

Brent Baisley

On Tue, Jul 14, 2009 at 11:52 AM, Marcus
Bointon wrote:
> You can combine the two queries you have in option 3 (you'll need to change
> field names, but you should get the idea), something like this:
> DELETE table1 FROM table1, (SELECT MAX(id) AS dupid, COUNT(id) AS dupcnt
> FROM table1 WHERE field1 IS NOT NULL GROUP BY link_id HAVING dupcnt>1) AS
> dups WHERE table1.id=dups.dupid;
> Marcus
> --
> Marcus Bointon
> Synchromedia Limited: Creators of http://www.smartmessages.net/
> UK resellers of info@hand CRM solutions
> marcus@synchromedia.co.uk | http://www.synchromedia.co.uk/
>
>
>

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org

Re: Removing Duplicate Records

am 14.07.2009 22:00:42 von Matt Neimeyer

Ah... Yes. Good point. I like this because I was planning on keeping
the output somewhere for a while. (In case we need an "accounting" at
some point) So it will be easy enough to dump what's being deleted to
the screen while we loop over our candidates.

Thanks!

On Tue, Jul 14, 2009 at 10:16 AM, Nathan Sullivan wrote:
> If you went with option #3, you could avoid the looping by using (tCount - 1) as the LIMIT in
> the delete statement instead of always using 1.

>> 3. Do... "SELECT count(*) AS tCount,OrgID,ContID FROM OrgContLink
>> GROUP BY OrgID,ContID HAVING tCount > 1". Then for every record in the
>> result "DELETE FROM OrgContLink WHERE OrgID=X AND ContID=Y LIMIT 1".
>> Then repeat until no results are found.

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/mysql?unsub=gcdmg-mysql-2@m.gmane.org