Tuesday, March 27, 2012

Duplicate records

how to we check in for duplicate records without using sort (remove duplicateS)

i need to remove duplicates based on four columns.

please let me know

Are the duplicates coming from a SQL query?

The sort transformation is really the best option, in my opinion, unless you can issue a "SELECT DISTINCT" in your source pull.|||

source is a text file.Is it possible to use sort as i am deciding distinct based on four columns.

For each row coming i will be looking at four columns to decide if the records are distinct.

|||

sureshv wrote:

source is a text file.Is it possible to use sort as i am deciding distinct based on four columns.

For each row coming i will be looking at four columns to decide if the records are distinct.

Yes, pick your four columns and sort by them. Then click the remove duplicates. It will remove duplicates based on the chosen sort columns.

However, do you need to pick one row over another, or can you discard any row without knowing which row you are keeping?|||You can do this with a synchronous non-blocking script component pretty easily, especially if the data is already sorted by the rows you want to check.

The following code concatenates all your columns that you want to test for uniqueness into one string "key". If the key of the current row is different from the key of the previous row, then you redirect it to the output. Otherwise it gets dropped. This is so fast and simple that you should try to sort your data as it comes from the source in the order of these columns.

Code Snippet

Dim PreviousKey As String = ""

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)
Dim ThisKey As String = Row.FirstName + "_" + Row.LastName + "_" + Row.LastName + "_" + Row.EmailAddress
If ThisKey <> PreviousKey Then
Row.DirectRowToOutput0()
PreviousKey = ThisKey
End If
End Sub


If the data cannot be sorted, then instead of comparing ThisKey to PreviousKey, you would check for the existence of the current key in a hash table. If it doesn't exist, you redirect the row and add it. This will get slow if you have lots of unique values and you have to have enough memory to hold all that data.

|||Yet another option is to stage your data file into a sql server table and then select distinct from there.|||

JayH wrote:

You can do this with a synchronous non-blocking script component pretty easily, especially if the data is already sorted by the rows you want to check.

Nevermind. I was too slow with my post. If the source is a text file, then you should just use the sort component.
|||

If i use sort it does remove duplicates.can i stored the removed duplicates in a table .Is it possible?

|||

sureshv wrote:

If i use sort it does remove duplicates.can i stored the removed duplicates in a table .Is it possible?

The sort REMOVES duplicates. You can't redirect them. Instead, if you want to do what you've just stated, use the sort transformation but leave the remove duplicates box UNCHECKED. Then use JayH's code below to allow you to check for duplicates and then to redirect rows that are in "duplicate."|||

sureshv wrote:

If i use sort it does remove duplicates.can i stored the removed duplicates in a table .Is it possible?

Nop. That is the bad thing about the Sort transform technique. You don't have control over which rows get discarded. I always stage data in tables; then I can use sql to handle that. In SQl Server 2005 and oracle for example, I use the RANK() function to detect and enumerate the duplicates. Then in data flows I use a conditional split to let pass all myRank=1 and redirect the rest to an error table is that is required.

If you don't use staging tables; you still can acomplish it using a script component.

|||

I have one problem. I use sort component and it tries to get all data from the source. i connected my sort output to script as u said...But i dont find rows comming out of the sort output and stays yellow at sort.it dosent go to next step.What does it mean i unchecked remove duplicates..and i select 4 columns which i want and sort order for them is 1,2,3,4...is it wrong with text file or have i not set it right.

I really appreciate all help from u guys..Thank u very much...

|||

sureshv wrote:

I have one problem. I use sort component and it tries to get all data from the source. i connected my sort output to script as u said...But i dont find rows comming out of the sort output and stays yellow at sort.it dosent go to next step.What does it mean i unchecked remove duplicates..and i select 4 columns which i want and sort order for them is 1,2,3,4...is it wrong with text file or have i not set it right.

I really appreciate all help from u guys..Thank u very much...

The sort is a resource intensive transformation. It requires to 'get' all rows prior to sort/dedup them, so you won't see any row in the output untill all rows have been consumed by the sort . If the data set is big and/or memory available for SSIS is small, the package can perform poorly.

|||

the incoming data seem to have 2,190,234 rows.I know that sort will wait for all incoming rows and then sort it.So what do i need to do for its performance to increase...Is their any efficient method to run faster..

Please le me know

Thanks again

|||

sureshv wrote:

the incoming data seem to have 2,190,234 rows.I know that sort will wait for all incoming rows and then sort it.So what do i need to do for its performance to increase...Is their any efficient method to run faster..

Please le me know

Thanks again

Stage the data first in SQL server. It is another option that I gave you earlier today. Then SELECT from the source using an ORDER BY clause choosing your four columns. Then use Jay's approach to determine if the incoming record is in duplicate or not.|||

Phil Brammer wrote:

sureshv wrote:

the incoming data seem to have 2,190,234 rows.I know that sort will wait for all incoming rows and then sort it.So what do i need to do for its performance to increase...Is their any efficient method to run faster..

Please le me know

Thanks again

Stage the data first in SQL server. It is another option that I gave you earlier today. Then SELECT from the source using an ORDER BY clause choosing your four columns. Then use Jay's approach to determine if the incoming record is in duplicate or not.

If you are going to stage the data in SQL Server 2005 table; I would use the t-sql rank() function and a conditional split in the dataflow to deduplicate and redirect the non-desire rows; as I explained in my other post bellow.

No comments:

Post a Comment