SQL Server Duplicates Part 1


This entry is part 1 of 2 in the series SQL Server Duplicates

When working with databases you will often need to check if there are any duplicates in your tables. To start with this we will use a simple example. Here we focus on a single column that has duplicates. Below is a table with some duplicate values for the Id column.

Shown below are two screenshots from SSMS. The first one shows the design of our table called [DuplicateCountExample] and the second one shows some data in it.

Here is our table.

SELECT Count(*) as TotalRows
FROM dbo.DuplicateCountExample

In the above query you get just one number: 5.

SELECT Id
FROM dbo.DuplicateCountExample
GROUP BY Id
HAVING (Count(*) > 1)

The result of the above query is that we get two results under the Id column: 23 and 44.

SELECT Id, FirstName
FROM dbo.DuplicateCountExample
WHERE Id IN 
    (SELECT Id
    FROM dbo.DuplicateCountExample
    GROUP BY Id
    HAVING (Count(*) > 1))
ORDER BY Id

The above query gives us the following.

Below is another query.

SELECT Id, FirstName, Count(*) as NumDup
FROM dbo.DuplicateCountExample
GROUP BY Id, FirstName
HAVING Id IN 
(SELECT Id
FROM dbo.DuplicateCountExample
GROUP BY Id
HAVING (Count(*) > 1))
ORDER BY Id

In this query we don’t see two rows for Bob. We may not need to see the record for Bob twice. Instead we get a count.

Counting the Number of Duplicates

Suppose you just need a single number that represents the number of duplicates in a single column. Why? Perhaps you have a stored procedure the needs to know if there are any duplicates and then display those duplicates to the user, present a descriptive message and raise an error. We also know that we could put a primary key on the column that does not allow for duplicates to be added, but we want to handle this differently by showing the duplicates and presenting the user with a custom error message. Also, in our case, the duplicates have already been added to the table. The approach we take here is to first count all of the rows. Next we count the number of unique rows. Then we subtract the number of unique rows from the total number of rows to get the number of duplicates. After that we show ALL of the rows that are part of the duplicates.

DECLARE @NumDups as INT;
DECLARE @TotalRows as INT;
DECLARE @TotalDistinct as INT;
SET @NumDups = 0;
SET @TotalRows = 0;
SET @TotalDistinct = 0;

SELECT @TotalRows = Count(*) FROM [dbo].[DuplicateCountExample];
SELECT @TotalDistinct = Count(*) FROM (SELECT DISTINCT Id FROM DuplicateCountExample) as abcd;
SET @NumDups = @TotalRows - @TotalDistinct;
PRINT 'Number of duplicates: ' + CAST(@NumDups AS VARCHAR(8));

IF @NumDups > 0
	BEGIN
		SELECT Id, FirstName
		FROM DuplicateCountExample
		WHERE Id
		IN ( 
			SELECT Id FROM DuplicateCountExample
			GROUP BY Id
			HAVING (Count(*) > 1)
		);
	END

Below is the results from SSMS.

Number of duplicates: 2
Id          FirstName
----------- ----------
23          Bob
44          John
23          Bob
44          Henry

(4 row(s) affected)

More Data and Order By

Let’s add a little bit more data and order the records. Here we find that there are four duplicates:

  • the second Bob
  • Henry
  • Jill
  • Mark

Below is just the changes to our T-SQL code and a listing of the results (Results to Text) in SSMS.

SELECT Id, FirstName
FROM DuplicateCountExample
WHERE Id
IN ( 
	SELECT Id FROM DuplicateCountExample
	GROUP BY Id
	HAVING (Count(*) > 1)
)
ORDER BY Id;

The results are shown below.

Number of duplicates: 4
Id          FirstName
----------- ----------
23          Bob
23          Bob
44          John
44          Henry
44          Jill
98          Jack
98          Mark

(7 row(s) affected)

Below is the changed data that we used in the above query.

Id          FirstName
----------- ----------
23          Bob
44          John
12          Sally
23          Bob
44          Henry
44          Jill
98          Jack
98          Mark

(8 row(s) affected)

Duplicates Defined with Multiple Columns

If we change our definition of what a duplicate actually is, we could say that a duplicate is a row where both the Id and the FirstName are the same. In this case, there would only be one duplicate. Our query would chage to GROUP BY Id, FirstName. The duplicate would be the row where the Id is 23 and the FirstName is Bob.

Series NavigationSQL Server Duplicates Part 2 >>