Tuesday, March 27, 2012

duplicate rows

What is the best way to remove duplicate rows from a table ?Wow talk about cross posts...

SELECT DISTINCT Col1, Col2 INTO myTable FROM myOtherTable?|||SELECT
Seibel_Extract.Account,
Seibel_Extract.Description,
Seibel_Extract.[Alliance Bookings],
Seibel_Extract.[Concatenated Unique ID],
Seibel_Extract.[Sub Region],
Seibel_Extract.Region,
Seibel_Extract.[Update Method],
Seibel_Extract.[Solution Offering],
Seibel_Extract.Type,
Seibel_Extract.[Operate Deal],
Seibel_Extract.[% of Deal],
Seibel_Extract.[Total Bookings],
Seibel_Extract.[Converted Bookings (EURO)],
Seibel_Extract.Alliance
FROM
Seibel_Extract
WHERE
Seibel_Extract.Account
In
(
SELECT
distinct [Account]
FROM
[Seibel_Extract] As Tmp
WHERE
[Description] = [Seibel_Extract].[Description]
And [Alliance Bookings] = [Seibel_Extract].[Alliance Bookings]
GROUP BY
[Account],
[Description],
[Alliance Bookings]
HAVING
Count(*)>1
)

This is the query that returns the duplicate records , I need to keep only one out of the duplicate records|||What do you mean by duplicate? Are all of the columns the same?

Then add SELECT DISTINCT

But they're not, are they.

So what do you mean by duplicate?|||What is the logic you want to apply to decide which one of each set of duplicate records to keep? If the underlying rows of data are truly exact duplicates, you will need to run select distinct into a new table.

blindman|||Originally posted by blindman
What is the logic you want to apply to decide which one of each set of duplicate records to keep? If the underlying rows of data are truly exact duplicates, you will need to run select distinct into a new table.

blindman

You know it's going to be more than 1 account number equals a dup.

You have to make a decision..

Rows are rows, and the totality of the columns are probably representative of an account.

The columns "relate" to each just like they are brothers and sisters...the other rows are like cousins..

You can't just mix and match the family togetehr...

WHT do you have many rows? Sounds like your addressing the problematic aftermath and not the cause...|||The PK is Acc Name, Description and amount only
Any two rows with these 3 as exact will be treated as dupes|||WHT do you have many rows? Sounds like your addressing the problematic aftermath and not the cause...

Yup you are somewhat right . I get some data every week from seibel and i have to clean the data and upload it into sql server. I cant do anything about the data. All I can do is a patch job|||So the data is historical by nature...is there a last updated date?

Can you add a column to the base table that has a default GetDate()...

Then Just select the row where the date=Max(date)|||No .. i dont have a last updated date ...
had there been something like that my work would be a lot easier ... I have only the fact that Any two rows with Acc Name, Description and amount as exact will be treated as dupes|||What's the DDL for Seibel_Extract

No date..and identity perhaps?

How do you populate the table?

(you can tell I'm running out of ideas...)|||I get an excel file which i upload through a dts into a table. No date column to go on . Identity column perhaps i can create and then drop after removing the dupes.

(you can tell I'm running out of ideas...)

I already have !!!!!!!|||So you're not appending the data, you do a full refresh everytime?

Does it matter to you which row you use?

If it doesn't...

[reduced to using a cursor...oh the humanity]

How about SELECT TOP 1 in a cursor

[/reduced to using a cursor...oh the humanity]

And insert the rows to a work table...|||to get rid of the duplicate rows I would add a couple of temporary extra columns to the table. an indentity column to uniquely indentify each row and a colukmn to indicate what rows you want to keep.

You should then be able to run a query to flag the ones you want to keep and thne another to delte those not flagged.

After that remove the temp columns.

As for the stopping of future data corrupting your new fresh table...

Cahnge your dts package so the when it inserts a record it checks to see if the record already exists, if it exists the skip the row, otherwise insert it.

If you need more help with this let me know and I will help some more.|||Originally posted by rokslide
to get rid of the duplicate rows I would add a couple of temporary extra columns to the table. an indentity column to uniquely indentify each row and a colukmn to indicate what rows you want to keep.

You should then be able to run a query to flag the ones you want to keep and thne another to delte those not flagged.

After that remove the temp columns.

As for the stopping of future data corrupting your new fresh table...

Cahnge your dts package so the when it inserts a record it checks to see if the record already exists, if it exists the skip the row, otherwise insert it.

If you need more help with this let me know and I will help some more.

Just curious...what algorithym would you suggest he use to pick a row?

Max(Identity)?|||Had to use a cursor finally ...
I hate doing this ... the proc is now taking 15 seconds to run .. but i guess will do for the time being
Will think of some other way in the meantime.|||If they are duplicate rows then it doesn't really matter which one he keeps right? If not then they are hardly duplicates...

you could you max or you could use top 1 either way would would. in the case I was using recently I used top 1 and both of the rows where are valid as each other.

personally if the data is coming in every day I would change the DTS package so that it checks for duplicates before doing any inserts. much better doing it that way then trying to deal with it after you have shoved the data into the table. you can then put a unique index on the unique values and ensure that you don't get duplicates again...|||There can't be any need to use a cursor for this. What about rokslides suggestion? It should be possible to handle it all in a stored proc, with perhaps the use of a staging table.

blindman|||I'd agree that a cursor is over kill,... I guess it depends what you feel comfortable with though... remember you have to maintain/correct this when it is needed...

I don't know what DTS method you are using to transfer the data across but if you use a Active Script in your datapump/data driven query you can utilise the lookup facility to check for existing records before inserting.|||Originally posted by rokslide
I'd agree that a cursor is over kill,... I guess it depends what you feel comfortable with though... remember you have to maintain/correct this when it is needed...

I don't know what DTS method you are using to transfer the data across but if you use a Active Script in your datapump/data driven query you can utilise the lookup facility to check for existing records before inserting.

To delete duplicate records where the duplicate record that is deleted is not important you can use one SQL statement. For example:

DELETE
FROM MyTable
WHERE (MyTableID NOT IN
(SELECT MAX(MyTableID)
FROM MyTable
GROUP BY MyField_1, MyField_2
HAVING COUNT(*) > 1))

The subquery identifies the highest ID number for each of the GROUP BY fields that will be kept. All other rows with that combination will be deleted|||The Acc Name, Description and amount are the PK...

There is no identity

There is no Add datetime column

It's a full replacement file from a 3rd party...

Max of what?

He'd have to add an identity column (or datetime) and then arbitrarily (which is what the cursor does) anyway...

And they are NOT Duplicates, in that sense...he has more than one PK...the other values are different...

And Yes, if he added an identity, then query listed above would work nicely (as nice as can be for the happenstance of where the row get loaded)

And the results may not be repeatable, depending on where sql server decided to place the data on the page...

My own opinion (MOO)

I would ask the operational owner what they want to do, since not a technical issue.|||Originally posted by Brett Kaiser
The Acc Name, Description and amount are the PK...

There is no identity

There is no Add datetime column

It's a full replacement file from a 3rd party...

Max of what?

He'd have to add an identity column (or datetime) and then arbitrarily (which is what the cursor does) anyway...

And they are NOT Duplicates, in that sense...he has more than one PK...the other values are different...

And Yes, if he added an identity, then query listed above would work nicely (as nice as can be for the happenstance of where the row get loaded)

And the results may not be repeatable, depending on where sql server decided to place the data on the page...

My own opinion (MOO)

I would ask the operational owner what they want to do, since not a technical issue.

If the source of the duplicate data cannot be changed then load the data into a staging table that has a PK set to an auto number. Then apply the SQL:

DELETE
FROM MyStagingTable
WHERE (MyAuotID NOT IN
(SELECT MAX(MyAutoID)
FROM MyStagingTable
GROUP BY Acc Name, Description, amount
HAVING COUNT(*) > 1))

You will then be left with a table of unique records that can then be copied to your destination table|||Thanks for all your help ...

Heres what did it

ALTER PROCEDURE sp_remove_duplicates AS

CREATE TABLE #Seibel_Duplicates (
[Id] int identity (1,1),
[Concatenated Unique ID] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Account] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Description] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Sub Region] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Region] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Update Method] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Solution Offering] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Type] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Operate Deal] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[% of Deal] [float] NULL ,
[Alliance Bookings] [float] NULL ,
[Converted Bookings (EURO)] [float] NULL ,
[Total Bookings] [float] NULL ,
[Alliance Weighted Bookings] [float] NULL ,
[Weighted] [float] NULL ,
[Opty Currency] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Pipeline Stage] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Sales Stage] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Contract Sign/Sales Stage Date] [datetime] NULL ,
[Calendar Year] [float] NULL ,
[Calendar Quarter] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Service Line] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[2002 Primary BU] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Selling BU] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Parent Sector] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Sector] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Work Location] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Alliance] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Alliance Count] [float] NULL ,
[Does This Exist in Pipeline] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Service Offer] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Segment] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Industry] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Sector*] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
[Country] [nvarchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL
) ON [PRIMARY]

insert into #Seibel_Duplicates
select
[Concatenated Unique ID],
[Account],
[Description],
[Sub Region],
[Region],
[Update Method],
[Solution Offering],
[Type],
[Operate Deal],
[% of Deal],
[Alliance Bookings],
[Converted Bookings (EURO)],
[Total Bookings],
[Alliance Weighted Bookings],
[Weighted],
[Opty Currency],
[Pipeline Stage],
[Sales Stage],
[Contract Sign/Sales Stage Date],
[Calendar Year],
[Calendar Quarter],
[Service Line],
[2002 Primary BU],
[Selling BU],
[Parent Sector],
[Sector],
[Work Location],
[Alliance],
[Alliance Count],
[Does This Exist in Pipeline],
[Service Offer],
[Segment],
[Industry],
[Sector*],
[Country]
from
Seibel_Extract
WHERE
Seibel_Extract.Account
In
(
SELECT
distinct [Account]
FROM
[Seibel_Extract] As Tmp
WHERE
[Description] = [Seibel_Extract].[Description]
And [Alliance Bookings] = [Seibel_Extract].[Alliance Bookings]
GROUP BY
[Account],
[Description],
[Alliance Bookings]
HAVING
Count(*)>1
)

DELETE
#Seibel_Duplicates
WHERE id NOT IN
(SELECT MAX(ID)
FROM #Seibel_Duplicates
GROUP BY [Account],
[Description],
[Alliance Bookings]
HAVING COUNT(*) > 1)

delete
Seibel_Extract
WHERE
Seibel_Extract.Account
In
(
SELECT
distinct [Account]
FROM
[Seibel_Extract] As Tmp
WHERE
[Description] = [Seibel_Extract].[Description]
And [Alliance Bookings] = [Seibel_Extract].[Alliance Bookings]
GROUP BY
[Account],
[Description],
[Alliance Bookings]
HAVING
Count(*)>1
)

insert into
Seibel_Extract
select
[Concatenated Unique ID],
[Account],
[Description],
[Sub Region],
[Region],
[Update Method],
[Solution Offering],
[Type],
[Operate Deal],
[% of Deal],
[Alliance Bookings],
[Converted Bookings (EURO)],
[Total Bookings],
[Alliance Weighted Bookings],
[Weighted],
[Opty Currency],
[Pipeline Stage],
[Sales Stage],
[Contract Sign/Sales Stage Date],
[Calendar Year],
[Calendar Quarter],
[Service Line],
[2002 Primary BU],
[Selling BU],
[Parent Sector],
[Sector],
[Work Location],
[Alliance],
[Alliance Count],
[Does This Exist in Pipeline],
[Service Offer],
[Segment],
[Industry],
[Sector*],
[Country]
from
#seibel_duplicates

No comments:

Post a Comment