Need help building query

Need help building query

am 05.11.2007 19:19:29 von Adrienne Boswell

Using MS SQL, ASP Classic. Here's the situation. We have a very
large database, with indexed views, separated by states. There are a
lot of "duplicate" records, eg:

Vendor table
1 Acme Widget 123 This Street SomeTown CA 91206
2 Acme Widget Co 123 This Street SomeTown CA 91206
3 Acme Widget Corp 123 This Street SomeTown CA 91206
4 Acme Widget 234 Some Street SomeTown CA 90212
5 Acme Widget Co 567 Other Street SomeTown CA 90028
6 Acme Widget Co 234 Some Street SomeTown CA 90212

By browsing this, I know that 2 is a dup and 6 are dupes. 3 looks
like it could be a dupe, but is not. The city information is actually
a zip_id which is tied to the city, state and zipcode tables.

So, I need to run a query to select the duplicates, and then delete
them. I know I am going to have to drop and create views, indexes,
etc.

Any help appreciated....

--
Adrienne Boswell at work
Administrator nextBlock.com
http://atlas.nextblock.com/files/
Please respond to the group so others can share

Re: Need help building query

am 05.11.2007 19:41:32 von exjxw.hannivoort

Adrienne Boswell wrote on 05 nov 2007 in
microsoft.public.inetserver.asp.db:

> Using MS SQL, ASP Classic. Here's the situation. We have a very
> large database, with indexed views, separated by states. There are a
> lot of "duplicate" records, eg:
>
> Vendor table
> 1 Acme Widget 123 This Street SomeTown CA 91206
> 2 Acme Widget Co 123 This Street SomeTown CA 91206
> 3 Acme Widget Corp 123 This Street SomeTown CA 91206
> 4 Acme Widget 234 Some Street SomeTown CA 90212
> 5 Acme Widget Co 567 Other Street SomeTown CA 90028
> 6 Acme Widget Co 234 Some Street SomeTown CA 90212
>
> By browsing this, I know that 2 is a dup and 6 are dupes. 3 looks
> like it could be a dupe, but is not. The city information is actually
> a zip_id which is tied to the city, state and zipcode tables.
>
> So, I need to run a query to select the duplicates, and then delete
> them. I know I am going to have to drop and create views, indexes,
> etc.

Adrienne,

Either you have to come up with a sound description of what of the near-
duplicates you accept as duplicates, or you will have to do it by hand.

It is just as all programming [with the exception of neural networks]:
you have to abstract the handwork before you can start programming.

If these firms were in a Arabian soucq, where al carpenters are brothers,
share the same name and work next to each other in the same street, there
would probably be no duplicates.

So only your local knowledge can guide you.



--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Re: Need help building query

am 05.11.2007 20:36:47 von Adrienne Boswell

On Nov 5, 11:41 am, "Evertjan." wrote:
> Adrienne Boswell wrote on 05 nov 2007 in
> microsoft.public.inetserver.asp.db:
>
>
>
> > Using MS SQL, ASP Classic. Here's the situation. We have a very
> > large database, with indexed views, separated by states. There are a
> > lot of "duplicate" records, eg:
>
> > Vendor table
> > 1 Acme Widget 123 This Street SomeTown CA 91206
> > 2 Acme Widget Co 123 This Street SomeTown CA 91206
> > 3 Acme Widget Corp 123 This Street SomeTown CA 91206
> > 4 Acme Widget 234 Some Street SomeTown CA 90212
> > 5 Acme Widget Co 567 Other Street SomeTown CA 90028
> > 6 Acme Widget Co 234 Some Street SomeTown CA 90212
>
> > By browsing this, I know that 2 is a dup and 6 are dupes. 3 looks
> > like it could be a dupe, but is not. The city information is actually
> > a zip_id which is tied to the city, state and zipcode tables.
>
> > So, I need to run a query to select the duplicates, and then delete
> > them. I know I am going to have to drop and create views, indexes,
> > etc.
>
> Adrienne,
>
> Either you have to come up with a sound description of what of the near-
> duplicates you accept as duplicates, or you will have to do it by hand.
>
> It is just as all programming [with the exception of neural networks]:
> you have to abstract the handwork before you can start programming.
>
> If these firms were in a Arabian soucq, where al carpenters are brothers,
> share the same name and work next to each other in the same street, there
> would probably be no duplicates.
>
> So only your local knowledge can guide you.
>

Thank you, Evertjan, I was afraid that was the answer... you are
always such a big help. Thank you again.

I can always tell my boss she has to pay for a new pair of glasses for
me... I'm going to need them going through 100,000+ records. Wish me
luck.

--
Adrienne Boswell at work
Administrator nextBlock.com
http://atlas.nextblock.com/files/
Please respond to the group so others can share

Re: Need help building query

am 05.11.2007 21:07:44 von exjxw.hannivoort

Adrienne Boswell wrote on 05 nov 2007 in
microsoft.public.inetserver.asp.db:

>> Adrienne,
>>
>> Either you have to come up with a sound description of what of the
>> near- duplicates you accept as duplicates, or you will have to do it
>> by hand.
>>
>> It is just as all programming [with the exception of neural
>> networks]: you have to abstract the handwork before you can start
>> programming.
>>
>> If these firms were in a Arabian soucq, where al carpenters are
>> brothers, share the same name and work next to each other in the same
>> street, there would probably be no duplicates.
>>
>> So only your local knowledge can guide you.
>>
>
> Thank you, Evertjan, I was afraid that was the answer... you are
> always such a big help. Thank you again.
>
> I can always tell my boss she has to pay for a new pair of glasses for
> me... I'm going to need them going through 100,000+ records. Wish me
> luck.
>

But:

You could build a programme
that selects probable duplicates for you to see.

count = 0
(start where I was yesterday or set counting = true)
do
dup = 0
if (first 3 zip-chars are same) then dup = dup + .5
if (first 4 zip-chars are same) then dup = dup + .5
if (first 5 zip-chars are same) then dup = dup + 1
if (house number is same then dup = dup + 1
if (first word of name is same) then dup = dup + 1
if (town name is same) then dup = dup + 1
if (telephonenumber is same) then dup = dup + 2
.... etc.
if dup>3 then // probability linit
if counting then
count = count+1
else
(show me for possible deletion)
if (done for today) then (safe where I was)
end if
end if
loop until eof or (done for today)


First do a counting run to see how many you catch.
If it is more than a 1000 make the probability more specific,
and les than a 100 lesss specific.

If you get a reasonable number,
buy your new spectacles and start deleting by hand.




--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Re: Need help building query

am 05.11.2007 21:18:09 von reb01501

Adrienne Boswell wrote:
> On Nov 5, 11:41 am, "Evertjan." wrote:
>> Adrienne Boswell wrote on 05 nov 2007 in
>> microsoft.public.inetserver.asp.db:
>>
>>
>>
>>> Using MS SQL, ASP Classic. Here's the situation. We have a very
>>> large database, with indexed views, separated by states. There are
>>> a lot of "duplicate" records, eg:
>>
>>> Vendor table
>>> 1 Acme Widget 123 This Street SomeTown CA 91206
>>> 2 Acme Widget Co 123 This Street SomeTown CA 91206
>>> 3 Acme Widget Corp 123 This Street SomeTown CA 91206
>>> 4 Acme Widget 234 Some Street SomeTown CA 90212
>>> 5 Acme Widget Co 567 Other Street SomeTown CA 90028
>>> 6 Acme Widget Co 234 Some Street SomeTown CA 90212
>>
>>> By browsing this, I know that 2 is a dup and 6 are dupes. 3 looks
>>> like it could be a dupe, but is not. The city information is
>>> actually a zip_id which is tied to the city, state and zipcode
>>> tables.
>>
>>> So, I need to run a query to select the duplicates, and then delete
>>> them. I know I am going to have to drop and create views, indexes,
>>> etc.
>>
>> Adrienne,
>>
>> Either you have to come up with a sound description of what of the
>> near- duplicates you accept as duplicates, or you will have to do it
>> by hand.
>>
>> It is just as all programming [with the exception of neural
>> networks]: you have to abstract the handwork before you can start
>> programming.
>>
>> If these firms were in a Arabian soucq, where al carpenters are
>> brothers, share the same name and work next to each other in the
>> same street, there would probably be no duplicates.
>>
>> So only your local knowledge can guide you.
>>
>
> Thank you, Evertjan, I was afraid that was the answer... you are
> always such a big help. Thank you again.
>
> I can always tell my boss she has to pay for a new pair of glasses for
> me... I'm going to need them going through 100,000+ records. Wish me
> luck.
>
You might be able to reduce the number of records you have to peer at by
running a query like:

select vendorid,vendorname from vendors where
left(vendorname,charindex(' ', vendorname)-1) in (
select left(vendorname,charindex(' ', vendorname)-1) firstword
from vendors
group by left(vendorname,charindex(' ', vendorname)-1)
having count(*) > 1)



--
Microsoft MVP -- ASP/ASP.NET
Please reply to the newsgroup. The email account listed in my From
header is my spam trap, so I don't check it very often. You will get a
quicker response by posting to the newsgroup.

Re: Need help building query

am 06.11.2007 00:45:01 von exjxw.hannivoort

Bob Barrows [MVP] wrote on 05 nov 2007 in
microsoft.public.inetserver.asp.db:

> You might be able to reduce the number of records you have to peer at by
> running a query like:
>
> select vendorid,vendorname from vendors where
> left(vendorname,charindex(' ', vendorname)-1) in (
> select left(vendorname,charindex(' ', vendorname)-1) firstword
> from vendors
> group by left(vendorname,charindex(' ', vendorname)-1)
> having count(*) > 1)
>

You are right, Bob,

but testing for couple of records with slight mismatches between fields in
this huge amount of records won't be that easy.

Perhaps only testing in groups with the same zip-code will do for starters?

--
Evertjan.
The Netherlands.
(Please change the x'es to dots in my emailaddress)

Re: Need help building query

am 06.11.2007 01:15:16 von reb01501

Evertjan. wrote:
> Bob Barrows [MVP] wrote on 05 nov 2007 in
> microsoft.public.inetserver.asp.db:
>
>> You might be able to reduce the number of records you have to peer
>> at by running a query like:
>>
>> select vendorid,vendorname from vendors where
>> left(vendorname,charindex(' ', vendorname)-1) in (
>> select left(vendorname,charindex(' ', vendorname)-1) firstword
>> from vendors
>> group by left(vendorname,charindex(' ', vendorname)-1)
>> having count(*) > 1)
>>
>
> You are right, Bob,
>
> but testing for couple of records with slight mismatches between
> fields in this huge amount of records won't be that easy.
>
> Perhaps only testing in groups with the same zip-code will do for
> starters?

That's a good starting point. It's going to take a massive effort involving
more than one person, I expect.

--
Microsoft MVP - ASP/ASP.NET
Please reply to the newsgroup. This email account is my spam trap so I
don't check it very often. If you must reply off-line, then remove the
"NO SPAM"

Re: Need help building query

am 06.11.2007 16:43:23 von Adrienne Boswell

Gazing into my crystal ball I observed "Bob Barrows [MVP]"
writing in
news:#xzfIoAIIHA.2064@TK2MSFTNGP06.phx.gbl:

> Evertjan. wrote:
>> Bob Barrows [MVP] wrote on 05 nov 2007 in
>> microsoft.public.inetserver.asp.db:
>>
>>> You might be able to reduce the number of records you have to peer
>>> at by running a query like:
>>>
>>> select vendorid,vendorname from vendors where
>>> left(vendorname,charindex(' ', vendorname)-1) in (
>>> select left(vendorname,charindex(' ', vendorname)-1) firstword
>>> from vendors
>>> group by left(vendorname,charindex(' ', vendorname)-1)
>>> having count(*) > 1)
>>>
>>
>> You are right, Bob,
>>
>> but testing for couple of records with slight mismatches between
>> fields in this huge amount of records won't be that easy.
>>
>> Perhaps only testing in groups with the same zip-code will do for
>> starters?
>
> That's a good starting point. It's going to take a massive effort
> involving more than one person, I expect.
>

Actually, I have found that phone numbers are a good place to start. If
one phone number has more than one vendor, it's probably a dupe.

Yes, I have to go through them manually, and started yesterday. I got
through about 5000 and found 10 duplicates. One of the reasons we have
this duplicate problem is that when records were imported several years
ago, the creteria was shady (I thought). For example, I would not have
imported a record that had only a phone number, and no address, simply
because if you are checking for VENDOR+ADDRESS+CITY+ZIPCODE one vendor
with no address, and one with will not look like a dupe (even though it
is).

Here's the fun part - I have a little under 3,000,000 records to comb
through.


--
Adrienne Boswell at Home
Arbpen Web Site Design Services
http://www.cavalcade-of-coding.info
Please respond to the group so others can share