Sunday 31 August 2014

Delete duplicate records in sql for specific columns

Delete duplicate records from table is the very common question and task we need in our day to day life, so we will see how easily we can delete duplicate records from a table by using subquery and using partition by function supported by Sql Server 2005 and later version.
Say the table name is Address, having following records
AddressID   CustomerID  Address  City    Zip        Country
1           100         A        B       1          USA2           100         A        B       1          USA3           111         A        B       1          USA4           101         B        C       2          USA5           101         B        C       2          USA6           112         B        C       2          USA7           103         C        D       3          USA8           103         C        D       3          USA9           113         C        D       3          USA
So If we will check duplicate records on the basis of CustomerID, Address, City, Zip and Country columns then record 1 and 2 is same, 4 and 5 is same, 7 and 8 is same. How to check this in Sql server, execute fullotin query
select *, 
ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country 
  Order By addressId) [ranked] from address
Records with “ranked” > 1 is the duplicate record. So how to delete them, simple, we will delete all those records where ranked > 1, in this case record with address Id 2, 5 and 8 should be deleted, right? So let’s write the query to check
With ranked_records AS(
   select *, 
    ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country 
    Order By addressId) [ranked] 
    from address)
select * from ranked_recordswhere ranked > 1
And here is the result:
AddressID   CustomerID  Address  City    Zip        Country   ranked2           100         A        B       1          USA       2
5           101         B        C       2          USA       2
8           103         C        D       3          USA       2
Means we are getting correct records, so change our last line to delete the record rather than select
select * from ranked_records where ranked > 1 To
delete ranked_records where ranked > 1
Up to now we checked duplicate records on columns Customer ID, Address, City, Zip and Country, but what if we want to check duplicate records on the basis of only three columns (Address, City and Zip)? Don't worry, in this case we need a small change in our query, only remove columns from partition by, let's see this:
select *, 
ROW_NUMBER() OVER(Partition By  address, city, zip 
  Order By addressId) [ranked] from address
and here is the result:
AddressID   CustomerID  Address  City    Zip        Country   Ranked
1           100         A        B       1          USA        1
2           100         A        B       1          USA        2
3           111         A        B       1          USA        3
4           101         B        C       2          USA        1
5           101         B        C       2          USA        2
6           112         B        C       2          USA        3
7           103         C        D       3          USA        1
8           103         C        D       3          USA        2
9           113         C        D       3          USA        3
It shows record number 1, 2 and 3 are duplicate, 4, 5, and 6 are duplicate and 7, 8 and 9 are also duplicate. To delete duplicate records we will use ranked column where ranked > 1 as we done in our previous example.
In all the above examples we delete those records which were added latter and keep the first records but as we are using the address table so last address will be the correct one so we need to delete all duplicate records except the last one. To achieve this we need to make a small change in our Row_number() query, so let’s change it, we will use Order By AddressID DESC:
 ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country 
  Order By addressId DESC) [ranked] 
If there is not Id column then how we can delete duplicate records? ID column is not needed to delete duplicate records but in that case we cannot guarantee which record will be deleted.
But if there is any column like CreationDate or DateUpdate etc. we can order our record to delete.
Another Way;
Lets say we want to delete all those records where address, city and zip is same in address table, then we will use the inner join on same table on the basis of required columns, see this
DELETE A1From Address A1Inner Join Address A2 ON A2.City = A1.City
    AND A2.Address = A1.Address
    AND A2.Zip = A1.Zip
Where A1.AddressID > A2.AddressID
We can also use subquery like this
DELETE A1From Address A1Where Exists (Select 1 From Address A2 
    Where A2.City = A1.City
    AND A2.Address = A1.Address
    AND A2.Zip = A1.Zip
    AND A1.AddressID > A2.AddressID), 
But these both technique is slow, if you will change this query's first line Delete A1 to Select A1.*, you will know why. Suppose you have 3 records with matching address, city and zip, when you will do the join all the 3 records will join From A1 to all the 3 records of A2 so 9 row will come in result, you can think if there is 100,000 matching records then it will create 100,000,000,000 records which is quite huge, while in our previous example where we used partition by, will always return the only those much records which we have in our table.
it's up to you which one will be best for your requirements.
You may ask if there is no column which can help to get the older or latest record then how we can order and delete older or newer records? No way, question itself says there is no way to know which record is older and which one is newer.

No comments:

Post a Comment