Data Cleaning Introduction


This entry is part 1 of 5 in the series Data Cleaning

As a data analyst or data scientist, you want to analyze data that is clean, not dirty. In the data analytics life cycle, the third step is Model Planning, or Process, depending on which model you follow. After you have decided on the data you need to meet your objectives and have gathered the data, you need to clean it.

Clean data is incredibly important for effective analysis. Clean data is complete, correct, and relevant to the problem you’re trying to solve. Dirty data is incomplete, incorrect, or irrelevant to the problem you’re trying to solve. Data engineers transform data into a useful format for analysis and give it a reliable infrastructure. This means they develop, maintain, and test databases, data processors and related systems. Data warehousing specialists develop processes and procedures to effectively store and organize data. They make sure that data is available, secure, and backed up to prevent loss.

For the Data Analyst, cleaning data and the pre-cleaning steps are critically important. When you use internal data that’s been verified and cared for by your company’s data engineers and data warehouse team, it’s more likely to be clean.

Dirty Data

What is dirty data? Answering this will help us understand how to clean it.

Other Information

Microsoft has an article called Top ten ways to clean your data. It’s from an Excel perspective.

Google has tips also. Top 10 tips to clean up data.

Here is an Excel video called Top Excel Functions for Data Analysts & What NOT to Waste Time Learning. Data analysts using Excel should be using Power Query. There are some Excel functions that you don’t need to know if you are using Power Query instead. According to MyOnlineTrainingHub in this video they are: TRIM, LEN, ISBLANK, CONCAT, and DAYS.

Here is an article on data validation called What is Data Validation?: It’s Working and Importance Simplified 101. Hevo is a company that provides an end-to-end data pipeline platform.

R language

There are several tools you can use in R programming language to clean your data. Have a look at the post called Cleaning Data in R.

Series NavigationCleaning Data for Analysis >>

Leave a Reply