Cleaning Data for Analysis


This entry is part 2 of 5 in the series Data Cleaning

Pre-Cleaning Steps

Before you clean your data there are three pre-cleaning steps you should take. Data analysts perform pre-cleaning activities. Pre-cleaning activities help you determine and maintain data integrity.

  • Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data.
  • Connect objectives to data by understanding how your business objectives can be served by an investigation into the data.
  • Know when to stop collecting data.

ChatGPT

This is a question that I asked ChatGPT: how does a data analyst clean data? This is the response

A data analyst typically cleans data by following these steps (I split ChatGPT’s first step into two steps):

  1. Identifying and handling missing data
  2. Handling incorrect data.
  3. Removing duplicates.
  4. Correcting inconsistent data (e.g. format, type, etc.).
  5. Handling outliers.
  6. Removing irrelevant or unnecessary information.
  7. Transforming data into a suitable format for analysis.
  8. Verifying the accuracy and completeness of the cleaned data.

What are the types of dirty data? Duplicate data, outdated data, incomplete data, incorrect (inaccurate), and inconsistent data. These five types are those reported by Google’s Data Analytics Certificate course.

Insufficient Data

Insufficient data has one or more of the following problems:

  • Comes from only one source
  • Continuously updates and is incomplete
  • Is outdated
  • Is geographically limited

A data analyst could request more time to find more data. They could perform the analysis with proxy data (similar data) and let the stakeholder know about it. Insights might be gleaned from that exercise. Also, an algorithm (data model) could be developed.

Series Navigation<< Data Cleaning IntroductionCleaning Data with Alex >>

Leave a Reply