Delete duplicate records from table is the very common question and task we need in our day to day life, so we will see how easily we can delete duplicate records from a table by using subquery and using partition by function supported by Sql Server 2005 and later version.
Say the table name is Address, having following records
AddressID CustomerID Address City Zip Country
1 100 A B 1 USA2 100 A B 1 USA3 111 A B 1 USA4 101 B C 2 USA5 101 B C 2 USA6 112 B C 2 USA7 103 C D 3 USA8 103 C D 3 USA9 113 C D 3 USA
So If we will check duplicate records on the basis of CustomerID, Address, City, Zip and Country columns then record 1 and 2 is same, 4 and 5 is same, 7 and 8 is same. How to check this in Sql server, execute fullotin query
select *,
ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country
Order By addressId) [ranked] from address
Records with “ranked” > 1 is the duplicate record. So how to delete them, simple, we will delete all those records where ranked > 1, in this case record with address Id 2, 5 and 8 should be deleted, right? So let’s write the query to check
With ranked_records AS(
select *,
ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country
Order By addressId) [ranked]
from address)
select * from ranked_recordswhere ranked > 1
And here is the result:
AddressID CustomerID Address City Zip Country ranked2 100 A B 1 USA 2
5 101 B C 2 USA 2
8 103 C D 3 USA 2
Means we are getting correct records, so change our last line to delete the record rather than select
select * from ranked_records where ranked > 1 To
delete ranked_records where ranked > 1
Up to now we checked duplicate records on columns Customer ID, Address, City, Zip and Country, but what if we want to check duplicate records on the basis of only three columns (Address, City and Zip)? Don't worry, in this case we need a small change in our query, only remove columns from partition by, let's see this:
select *,
ROW_NUMBER() OVER(Partition By address, city, zip
Order By addressId) [ranked] from address
and here is the result:
AddressID CustomerID Address City Zip Country Ranked
1 100 A B 1 USA 1
2 100 A B 1 USA 2
3 111 A B 1 USA 3
4 101 B C 2 USA 1
5 101 B C 2 USA 2
6 112 B C 2 USA 3
7 103 C D 3 USA 1
8 103 C D 3 USA 2
9 113 C D 3 USA 3
It shows record number 1, 2 and 3 are duplicate, 4, 5, and 6 are duplicate and 7, 8 and 9 are also duplicate. To delete duplicate records we will use ranked column where ranked > 1 as we done in our previous example.
In all the above examples we delete those records which were added latter and keep the first records but as we are using the address table so last address will be the correct one so we need to delete all duplicate records except the last one. To achieve this we need to make a small change in our Row_number() query, so let’s change it, we will use Order By AddressID DESC:
ROW_NUMBER() OVER(Partition By customerId, address, city, zip, country
Order By addressId DESC) [ranked]
If there is not Id column then how we can delete duplicate records? ID column is not needed to delete duplicate records but in that case we cannot guarantee which record will be deleted.
But if there is any column like CreationDate or DateUpdate etc. we can order our record to delete.
Another Way;
Lets say we want to delete all those records where address, city and zip is same in address table, then we will use the inner join on same table on the basis of required columns, see this
DELETE A1From Address A1Inner Join Address A2 ON A2.City = A1.City
AND A2.Address = A1.Address
AND A2.Zip = A1.Zip
Where A1.AddressID > A2.AddressID
We can also use subquery like this
DELETE A1From Address A1Where Exists (Select 1 From Address A2
Where A2.City = A1.City
AND A2.Address = A1.Address
AND A2.Zip = A1.Zip
AND A1.AddressID > A2.AddressID),
But these both technique is slow, if you will change this query's first line Delete A1 to Select A1.*, you will know why. Suppose you have 3 records with matching address, city and zip, when you will do the join all the 3 records will join From A1 to all the 3 records of A2 so 9 row will come in result, you can think if there is 100,000 matching records then it will create 100,000,000,000 records which is quite huge, while in our previous example where we used partition by, will always return the only those much records which we have in our table.
it's up to you which one will be best for your requirements.
You may ask if there is no column which can help to get the older or latest record then how we can order and delete older or newer records? No way, question itself says there is no way to know which record is older and which one is newer.
No comments:
Post a Comment