Cleaning data - getting rid of duplicate rows

Cleaning data - getting rid of duplicate rows

am 20.08.2007 13:47:50 von teddysnips

Data Cleansing:

In the example (SQL Server DDL below) there are two tables -
ExampleCustomer, and ExampleCar.

ExampleCar is a lookup table. The ExampleCustomer table has a foreign
key to the ExampleCar table. There should be two rows in there, "Ford
Focus", and "Ford Galaxy". This table populates a drop-down list in
the application, ordered by CarID. So a user is adding a new Customer
record, selects a car from the list, and hey Presto!

Someone came along and messed with the data, so now there are two rows
for each car. Yes, I know I should have set the CarType column
unique, but I didn't.

The user has done what users always do - work around the problem. So
she has added some customers using the first occurrence of the car in
the drop-down list, except for the last record when she unaccountably
selected the second occurrence.

I have been tasked with cleaning up the data. The object is to delete
all duplicate rows, but without leaving any orphaned references. At
the end of the exercise the rows in the tables should be:

CarID CarType
1 Ford Focus
3 Ford Galaxy

PersonName CarID
Adam Smith 1
Ben Smith 1
Colin Smith 3
Dave Smith 3
Edward Smith 3
Fred Smith 3

Note that the CarID for Fred Smith has been updated to the
predominating row for the Ford Galaxy type.

Is this clear? Can anyone see a way of doing this in a small number
of queries? I could think of a way of doing it using cursors etc.,
but the prevailing wisdom is that there's ALWAYS a way to do it
without using cursors.

Thanks

Edward


DDL
=========================
if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCar]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCar]
GO

if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
[ExampleCustomer]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
drop table [dbo].[ExampleCustomer]
GO

CREATE TABLE [dbo].[ExampleCar] (
[CarID] [int] IDENTITY (1, 1) NOT NULL ,
[CarType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
NULL
) ON [PRIMARY]
GO

CREATE TABLE [dbo].[ExampleCustomer] (
[PersonName] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS
NULL ,
[CarID] [int] NULL
) ON [PRIMARY]
GO

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Focus')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCar (CarType)
VALUES ('Ford Galaxy')

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Adam Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Ben Smith', 1)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Colin Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Dave Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Edward Smith', 3)

INSERT INTO ExampleCustomer (PersonName, CarID)
VALUES ('Fred Smith', 4)

Re: Cleaning data - getting rid of duplicate rows

am 20.08.2007 14:07:03 von jhofmeyr

On 20 Aug, 12:47, teddysn...@hotmail.com wrote:
> Data Cleansing:
>
> In the example (SQL Server DDL below) there are two tables -
> ExampleCustomer, and ExampleCar.
>
> ExampleCar is a lookup table. The ExampleCustomer table has a foreign
> key to the ExampleCar table. There should be two rows in there, "Ford
> Focus", and "Ford Galaxy". This table populates a drop-down list in
> the application, ordered by CarID. So a user is adding a new Customer
> record, selects a car from the list, and hey Presto!
>
> Someone came along and messed with the data, so now there are two rows
> for each car. Yes, I know I should have set the CarType column
> unique, but I didn't.
>
> The user has done what users always do - work around the problem. So
> she has added some customers using the first occurrence of the car in
> the drop-down list, except for the last record when she unaccountably
> selected the second occurrence.
>
> I have been tasked with cleaning up the data. The object is to delete
> all duplicate rows, but without leaving any orphaned references. At
> the end of the exercise the rows in the tables should be:
>
> CarID CarType
> 1 Ford Focus
> 3 Ford Galaxy
>
> PersonName CarID
> Adam Smith 1
> Ben Smith 1
> Colin Smith 3
> Dave Smith 3
> Edward Smith 3
> Fred Smith 3
>
> Note that the CarID for Fred Smith has been updated to the
> predominating row for the Ford Galaxy type.
>
> Is this clear? Can anyone see a way of doing this in a small number
> of queries? I could think of a way of doing it using cursors etc.,
> but the prevailing wisdom is that there's ALWAYS a way to do it
> without using cursors.
>
> Thanks
>
> Edward
>
> DDL
> =========================
> if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
> [ExampleCar]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
> drop table [dbo].[ExampleCar]
> GO
>
> if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
> [ExampleCustomer]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
> drop table [dbo].[ExampleCustomer]
> GO
>
> CREATE TABLE [dbo].[ExampleCar] (
> [CarID] [int] IDENTITY (1, 1) NOT NULL ,
> [CarType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
> NULL
> ) ON [PRIMARY]
> GO
>
> CREATE TABLE [dbo].[ExampleCustomer] (
> [PersonName] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS
> NULL ,
> [CarID] [int] NULL
> ) ON [PRIMARY]
> GO
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Focus')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Focus')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Galaxy')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Galaxy')
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Adam Smith', 1)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Ben Smith', 1)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Colin Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Dave Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Edward Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Fred Smith', 4)


SELECT
ecust.PersonName
, (
SELECT MIN(CarID)
FROM [ExampleCar]
WHERE [CarType] IN (
SELECT [CarType]
FROM [ExampleCar]
WHERE CarID = ecust.CarID
)
) AS CarID
FROM [ExampleCustomer] ecust

Not pretty but that should get you started.

Good Luck
J

Re: Cleaning data - getting rid of duplicate rows

am 20.08.2007 14:07:03 von jhofmeyr

On 20 Aug, 12:47, teddysn...@hotmail.com wrote:
> Data Cleansing:
>
> In the example (SQL Server DDL below) there are two tables -
> ExampleCustomer, and ExampleCar.
>
> ExampleCar is a lookup table. The ExampleCustomer table has a foreign
> key to the ExampleCar table. There should be two rows in there, "Ford
> Focus", and "Ford Galaxy". This table populates a drop-down list in
> the application, ordered by CarID. So a user is adding a new Customer
> record, selects a car from the list, and hey Presto!
>
> Someone came along and messed with the data, so now there are two rows
> for each car. Yes, I know I should have set the CarType column
> unique, but I didn't.
>
> The user has done what users always do - work around the problem. So
> she has added some customers using the first occurrence of the car in
> the drop-down list, except for the last record when she unaccountably
> selected the second occurrence.
>
> I have been tasked with cleaning up the data. The object is to delete
> all duplicate rows, but without leaving any orphaned references. At
> the end of the exercise the rows in the tables should be:
>
> CarID CarType
> 1 Ford Focus
> 3 Ford Galaxy
>
> PersonName CarID
> Adam Smith 1
> Ben Smith 1
> Colin Smith 3
> Dave Smith 3
> Edward Smith 3
> Fred Smith 3
>
> Note that the CarID for Fred Smith has been updated to the
> predominating row for the Ford Galaxy type.
>
> Is this clear? Can anyone see a way of doing this in a small number
> of queries? I could think of a way of doing it using cursors etc.,
> but the prevailing wisdom is that there's ALWAYS a way to do it
> without using cursors.
>
> Thanks
>
> Edward
>
> DDL
> =========================
> if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
> [ExampleCar]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
> drop table [dbo].[ExampleCar]
> GO
>
> if exists (select * from dbo.sysobjects where id = object_id(N'[dbo].
> [ExampleCustomer]') and OBJECTPROPERTY(id, N'IsUserTable') = 1)
> drop table [dbo].[ExampleCustomer]
> GO
>
> CREATE TABLE [dbo].[ExampleCar] (
> [CarID] [int] IDENTITY (1, 1) NOT NULL ,
> [CarType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NOT
> NULL
> ) ON [PRIMARY]
> GO
>
> CREATE TABLE [dbo].[ExampleCustomer] (
> [PersonName] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS
> NULL ,
> [CarID] [int] NULL
> ) ON [PRIMARY]
> GO
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Focus')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Focus')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Galaxy')
>
> INSERT INTO ExampleCar (CarType)
> VALUES ('Ford Galaxy')
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Adam Smith', 1)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Ben Smith', 1)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Colin Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Dave Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Edward Smith', 3)
>
> INSERT INTO ExampleCustomer (PersonName, CarID)
> VALUES ('Fred Smith', 4)


SELECT
ecust.PersonName
, (
SELECT MIN(CarID)
FROM [ExampleCar]
WHERE [CarType] IN (
SELECT [CarType]
FROM [ExampleCar]
WHERE CarID = ecust.CarID
)
) AS CarID
FROM [ExampleCustomer] ecust

Not pretty but that should get you started.

Good Luck
J

Re: Cleaning data - getting rid of duplicate rows

am 20.08.2007 23:52:37 von Erland Sommarskog

(teddysnips@hotmail.com) writes:
> I have been tasked with cleaning up the data. The object is to delete
> all duplicate rows, but without leaving any orphaned references. At
> the end of the exercise the rows in the tables should be:
>
> CarID CarType
> 1 Ford Focus
> 3 Ford Galaxy
>
> PersonName CarID
> Adam Smith 1
> Ben Smith 1
> Colin Smith 3
> Dave Smith 3
> Edward Smith 3
> Fred Smith 3
>
> Note that the CarID for Fred Smith has been updated to the
> predominating row for the Ford Galaxy type.

UPDATE ExampleCustomer
SET CarID = Mincar.CarID
FROM ExampleCustomer C
JOIN ExampleCar Car ON C.CarID = Car.CarID
JOIN (SELECT CarType, CarID = MIN(CarID)
FROM ExampleCar
GROUP BY CarType) AS Mincar ON Mincar.CarType = Car.CarType

DELETE ExampleCar
FROM ExampleCar C
WHERE EXISTS (SELECT *
FROM ExampleCar C1
WHERE C1.CarType = C.CarType
AND C.CarID > C1.CarID)


There is no foreign declared in the example, but I hope that there is
in the real case.

Of course, risk is that your uzer have spelt the duplicates differently.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx

Re: Cleaning data - getting rid of duplicate rows

am 20.08.2007 23:52:37 von Erland Sommarskog

(teddysnips@hotmail.com) writes:
> I have been tasked with cleaning up the data. The object is to delete
> all duplicate rows, but without leaving any orphaned references. At
> the end of the exercise the rows in the tables should be:
>
> CarID CarType
> 1 Ford Focus
> 3 Ford Galaxy
>
> PersonName CarID
> Adam Smith 1
> Ben Smith 1
> Colin Smith 3
> Dave Smith 3
> Edward Smith 3
> Fred Smith 3
>
> Note that the CarID for Fred Smith has been updated to the
> predominating row for the Ford Galaxy type.

UPDATE ExampleCustomer
SET CarID = Mincar.CarID
FROM ExampleCustomer C
JOIN ExampleCar Car ON C.CarID = Car.CarID
JOIN (SELECT CarType, CarID = MIN(CarID)
FROM ExampleCar
GROUP BY CarType) AS Mincar ON Mincar.CarType = Car.CarType

DELETE ExampleCar
FROM ExampleCar C
WHERE EXISTS (SELECT *
FROM ExampleCar C1
WHERE C1.CarType = C.CarType
AND C.CarID > C1.CarID)


There is no foreign declared in the example, but I hope that there is
in the real case.

Of course, risk is that your uzer have spelt the duplicates differently.

--
Erland Sommarskog, SQL Server MVP, esquel@sommarskog.se

Books Online for SQL Server 2005 at
http://www.microsoft.com/technet/prodtechnol/sql/2005/downlo ads/books.mspx
Books Online for SQL Server 2000 at
http://www.microsoft.com/sql/prodinfo/previousversions/books .mspx