Tuesday, March 27, 2012

Duplicate records - difference method?

Hi,
My scenario is that i have 2 system with name and adress (100.000 names),
that have to be merged into 1 system without any duplicates.
The problem is that the spelling is not 100% between the system.
One way to find duplicate is to group name,adress and count > 1.
My dream is to use the sound index "Difference" so can i get around the
spelling problem.
DIFFERENCE
Returns the difference between the SOUNDEX values of two character
expressions as an integer.
Syntax
DIFFERENCE ( character_expression , character_expression )
Is that possible to use DIFFERENCE to find duplicates?
And how should the t-sql look like?
Example
name adress city
charles way1 state1
charle waj1 stat1
charlez vay1 stat1
I want to find this example, that this 3 is duplicates.
Should i use ordinary way with group and count >1, this would not be
duplicates.
Help
Thanx
TwSo you want these to be considered duplicates:
FirstName LastName
Jon Smyth
John Smith
Jonathan Smythe
John Smyth
'|||Hi,
Yes, i want these to be considered duplicates.
But i also want to validate adress and city to see if these is duplicates.
FirstName LastName Adress City
Jon Smyth Street 1 Palace1
John Smith Stret1 Palac1
Jonathan Smythe Stret 1 Palac 1
John Smyth Street1 Palace 1
Who can i do that in a t-sql?
"Aaron Bertrand [SQL Server MVP]" <ten.xoc@.dnartreb.noraa> skrev i
meddelandet news:uey1ZpUkFHA.2852@.TK2MSFTNGP15.phx.gbl...
> So you want these to be considered duplicates:
> FirstName LastName
> Jon Smyth
> John Smith
> Jonathan Smythe
> John Smyth
> '
>|||>> have 2 system with name and adress (100.000 names), that have to be mer
ged into 1 system without any duplicates. The problem is that the spelling i
s not 100% between the system. <<
Look up Melissa Data and get their software. Life is too short to
re-invent the wheel.|||But it cost a lot of money.
It should not be so difficult to do it in sql server.
I looking for a little help and after that i fix it, i hope :)
For the soundindex and difference function is in SQL Server.
Can i only run a t-sql and get out the difference integer, after that i fix
a algorithm.
But how should i run the t-sql to validate these values?
// tw
"--CELKO--" <jcelko212@.earthlink.net> skrev i meddelandet
news:1122318950.891968.136790@.g14g2000cwa.googlegroups.com...
> Look up Melissa Data and get their software. Life is too short to
> re-invent the wheel.
>|||>> But it cost a lot of money.<<
How much does doing it wrong cost you? How much is your time worth?|||> But how should i run the t-sql to validate these values?
T-SQL is supplying you the difference and soundex values, no?
So, you must do one of three things:
(a) determine beforehand what soundex/difference level means duplicate;
(b) inspect the results manually and make decisions; or,
(c) get software that does it right, and you might get some sleep at night.

No comments:

Post a Comment