- Data Cleaning Introduction
- Cleaning Data for Analysis
- Cleaning Data with Alex
- Data Structuring & Cleaning with Mike
- Loop Through pandas DataFrame
What are the steps to cleaning your data in a data analysis project? This article combines a few sources of information, so I called it “with Mike”. I have another post that is called EDA Cleaning with Pandas. In that article, I break down EDA into six parts. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting.
Before you clean your data you’ll do a few of things first. You want to know the purpose of your project and who the stakeholder are. This article will assume you are using Python as your programming language. You might be using Jupyter Notebook as you programming environment. If you can get hold of a data dictionary, do so. It will be very helpful.
The links in the list below all link to code examples in Python. They are usually linking to code that uses the pandas library.
- Import the data (reading files)
- Initial Exploratory Data Analysis (EDA)
- Drop any Columns we Don’t Need
- Rename Columns as necessary (reorder if necessary)
- Check the Data Types of the columns
- Check the numerical data ranges (describe) of the columns
- Uniqueness constraints (are there any duplicates?)
- Check Outliers (statistics and boxplots)
- Remove Bad Characters in Text Columns (remove begin and trail spaces, remove Non-alphanumeric)
- Explore the Dependent Variable
- Are the categorical columns consistent? (correct categories, correct spelling)
- Text length is within limits – consider empty strings and nulls
- Text data has consistent formatting (phone numbers, postal codes, etc.)
- Numeric Unit Uniformity (numbers are in same units – money, temperature etc.)
- Datetime Uniformity (mm-dd-yyyy or dd-mm-yyyy or yyyy-mm-dd)
- Crossfield Validation (check calculations in calculated columns) and/or create calculated columns
- Missing Data
From a Columns Perspective
Let’s reorganize the above list in a different way. Looking at data cleaning from a dataset (rows and columns) perspective may be a more systematic approach that you might appreciate. Consider a single dataset, or “table” or Excel worksheet. Most of the above items are working with columns, but a few are working with rows.
- Drop any Columns we don’t need
- Rename Columns as necessary
- Reorder the columns as needed
- Check the Data Types of the columns
- Check the numerical data ranges (describe) of the columns
- Check Outliers (statistics and boxplots)
- Remove Bad Characters in Text Columns (remove begin and trail spaces, remove Non-alphanumeric)
- Are the categorical columns consistent? (correct categories, correct spelling)
- Text length is within limits – consider empty strings and nulls
- Text data has consistent formatting (phone numbers, postal codes, etc.)
- Numeric Unit Uniformity (numbers are in same units – money, temperature etc.)
- Datetime Uniformity (mm-dd-yyyy or dd-mm-yyyy or yyyy-mm-dd)
- Crossfield Validation (check calculations in calculated columns) and/or create calculated columns
- Missing Data
Remove unnecessary columns from the dataset. Locate errors in spelling of words and categories and potentially use string manipulation to fix those errors. Locate outliers.
From a Rows Perspective
Rows. Find missing data in one or more columns of one or more rows. Check for duplicate rows.
- Uniqueness constraints (are there any duplicate rows?)
- Are there any missing rows? Do we have all the data?
Data Wrangling. What does this mean? Sorting and re-ordering.
From a Multiple-Dataset (multiple table) Perspective
Here we’ll look at multiple ‘tables’ or ‘datasets’ or ‘worksheets’. We can split or combine. There are two general ways to combine two datasets. We can join them based on a key in a row. Those familiar with SQL will recognize this as joining. We can concatenate (stack) datasets on on top of the other as long as the columns are the same data types. SQL users will recognize this as UNION and UNION ALL. One use case of concatenating is when we have transactional data stored in files where each month is in its own file and we need to combine those.