- Data Cleaning Introduction
- Cleaning Data for Analysis
- Cleaning Data with Alex
- Data Structuring & Cleaning with Mike
- Loop Through pandas DataFrame
Pre-Cleaning Steps
Before you clean your data there are three pre-cleaning steps you should take. Data analysts perform pre-cleaning activities. Pre-cleaning activities help you determine and maintain data integrity.
- Determine data integrity by assessing the overall accuracy, consistency, and completeness of the data.
- Connect objectives to data by understanding how your business objectives can be served by an investigation into the data.
- Know when to stop collecting data.
ChatGPT
This is a question that I asked ChatGPT: how does a data analyst clean data? This is the response
A data analyst typically cleans data by following these steps (I split ChatGPT’s first step into two steps):
- Identifying and handling missing data
- Handling incorrect data.
- Removing duplicates.
- Correcting inconsistent data (e.g. format, type, etc.).
- Handling outliers.
- Removing irrelevant or unnecessary information.
- Transforming data into a suitable format for analysis.
- Verifying the accuracy and completeness of the cleaned data.
What are the types of dirty data? Duplicate data, outdated data, incomplete data, incorrect (inaccurate), and inconsistent data. These five types are those reported by Google’s Data Analytics Certificate course.
Insufficient Data
Insufficient data has one or more of the following problems:
- Comes from only one source
- Continuously updates and is incomplete
- Is outdated
- Is geographically limited
A data analyst could request more time to find more data. They could perform the analysis with proxy data (similar data) and let the stakeholder know about it. Insights might be gleaned from that exercise. Also, an algorithm (data model) could be developed.