SQL Server Data Validation Part 2


This entry is part 2 of 3 in the series SQL Server Data Validation

2. Fruit Name and Fruit Code – Case 1

This post is a continuation of SQL Server Data Validation Introduction..

We need to ensure that there is a one-to-one relationship between the fruit name and the fruit code. By one-to-one we mean that if you were to query the fruit name Apples you must return only one result, namely 101 in our example. Below is a screenshot from Case One of the previous post. In this table all of the data is good. It is okay that the Fruit Variety Code as more than on 09 because it exists in two different Fruit Types.

In the screenshot below we can see that we have a problem with the data. The table below is called [FruitsErrorFruitCode]. If we query for Apples we can get two different results: 101 and 938. What query can we use to identify rows that fail to pass this test? Of course when we look at the following table we can see that the fruit code 938 for the third row of Apples is probably a mistake. It should be 001. We need to identify a query that will show us all of these types of errors assuming we had thousands of rows. This query uses a common table expression.

Here is the SQL code which will show duplicates (errors) of the Fruit Code. Below the code listing there is a screenshot of the results as they appear in SSMS.

with cte as(
-- unique combinations of...
select FruitName,FruitCode
from [dbo].[FruitsErrorFruitCode]
group by FruitName,FruitCode
)  
select FruitName,FruitCode
from cte
WHERE FruitName IN -- show duplicates of Fruit Name
    -- to show errors in Fruit Code with a correct Fruit Name
    (SELECT FruitName
    FROM cte
    GROUP BY FruitName
    HAVING (Count(*) > 1))
order by FruitName;

Let’s break this down. First of all we could just look at unique combinations of the Fruit Name and the Fruit Code with a GROUP BY query, as shown below. This is fine. The only thing is that we don’t need to see the data that is correct. We only want to see the data that is a problem so that we can fix it. This is why we use the SQL code above. It takes all of the unique combinations and then only shows us row where we have more that one row for the Fruit Name column. It shows duplicates.

-- unique combinations of...
select FruitName,FruitCode
from [dbo].[FruitsErrorFruitCode]
group by FruitName,FruitCode

We also should run the query on duplicates of Fruit Codes. Again, we start with all of the unique combinations of Fruit Name and Fruit Code and then show only duplicates of the Fruit Code.

with cte as(
-- unique combinations of...
select FruitName,FruitCode
from [dbo].[FruitsErrorFruitCode]
group by FruitName,FruitCode
)  
select FruitName,FruitCode
from cte
WHERE FruitCode IN -- show duplicates
    (SELECT FruitCode
    FROM cte
    GROUP BY FruitCode
    HAVING (Count(*) > 1))
order by FruitCode;

The above query does not give us any result rows! Why not? This is because we are looking for duplicates of the FruitCode. To fully test the one-to-one relationship between Fruit Name and Fruit Code we need to run two sets of queries.

Misspelled Fruit Name

Consider the following table called FruitsErrorName. We can see that one of the Apples rows has a misspelling of Apples as Apple. How do we find all potential misspellings of the Fruit Name? We look for duplicates of Fruit Code after we get all of the unique combinations of Fruit Name and Fruit Code as we do in the following query.

with cte as(
-- unique combinations of...
select FruitName,FruitCode
from [dbo].[FruitsErrorName]
group by FruitName,FruitCode
)  
select FruitName,FruitCode
from cte
WHERE FruitCode IN -- show duplicates of Fruit Code
    -- to show errors in Fruit Name
    (SELECT FruitCode
    FROM cte
    GROUP BY FruitCode
    HAVING (Count(*) > 1))
order by FruitCode;

The results of the query are shown below in the screen shot.

Conculsion – Run Two Queries for Each One-To-One Relationship

We need to run at least 2 queries for each one-to-one relationship we have in our table to ensure that we can identify all of the errors. Only the errors will be shown. If we have an error in the Fruit Name and the Fruit Code in the same row, we may need to use our eyeballs on a simple GROUP BY query and scan down the list (after sorting it to make it easier). You can use this code in a non-production server, after you change the Table1, Col_1 and Col_2 to your actual names.

-- a general form of checking for errors in a 
-- one-to-one relationship within a table.
-- change Table1, Col_1 and Col_2 to the actual names.
with cte as(
select Col_1,Col_2
from [dbo].[Table1]
group by Col_1,Col_2
)  
select Col_1,Col_2
from cte
WHERE Col_2 IN -- show duplicates of Col_2
    -- to show errors in Col_1
    (SELECT Col_2
    FROM cte
    GROUP BY Col_2
    HAVING (Count(*) > 1))
order by Col_2;

with cte as(
select Col_1,Col_2
from [dbo].[Table1]
group by Col_1,Col_2
)  
select Col_1,Col_2
from cte
WHERE Col_1 IN -- show duplicates of Col_1
    -- to show errors in Col_2
    (SELECT Col_1
    FROM cte
    GROUP BY Col_1
    HAVING (Count(*) > 1))
order by Col_1;
Series Navigation<< SQL Server Data Validation IntroductionSQL Server Data Validation Part 3 >>