Data Structuring & Cleaning with Mike

This entry is part 4 of 5 in the series Data Cleaning

What are the steps to cleaning your data in a data analysis project? This article combines a few sources of information, so I called it “with Mike”. I have another post that is called EDA Cleaning with Pandas. In that article, I break down EDA into six parts. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting.

Before you clean your data you’ll do a few of things first. You want to know the purpose of your project and who the stakeholder are. This article will assume you are using Python as your programming language. You might be using Jupyter Notebook as you programming environment. If you can get hold of a data dictionary, do so. It will be very helpful.

The links in the list below all link to code examples in Python. They are usually linking to code that uses the pandas library.

Import the data (reading files)
Initial Exploratory Data Analysis (EDA)
Drop any Columns we Don’t Need
Rename Columns as necessary (reorder if necessary)
Check the Data Types of the columns
Check the numerical data ranges (describe) of the columns
Uniqueness constraints (are there any duplicates?)
Check Outliers (statistics and boxplots)
Remove Bad Characters in Text Columns (remove begin and trail spaces, remove Non-alphanumeric)
Explore the Dependent Variable
Are the categorical columns consistent? (correct categories, correct spelling)
Text length is within limits – consider empty strings and nulls
Text data has consistent formatting (phone numbers, postal codes, etc.)
Numeric Unit Uniformity (numbers are in same units – money, temperature etc.)
Datetime Uniformity (mm-dd-yyyy or dd-mm-yyyy or yyyy-mm-dd)
Crossfield Validation (check calculations in calculated columns) and/or create calculated columns
Missing Data

From a Columns Perspective

Let’s reorganize the above list in a different way. Looking at data cleaning from a dataset (rows and columns) perspective may be a more systematic approach that you might appreciate. Consider a single dataset, or “table” or Excel worksheet. Most of the above items are working with columns, but a few are working with rows.

Drop any Columns we don’t need
Rename Columns as necessary
Reorder the columns as needed
Check the Data Types of the columns
Check the numerical data ranges (describe) of the columns
Check Outliers (statistics and boxplots)
Remove Bad Characters in Text Columns (remove begin and trail spaces, remove Non-alphanumeric)
Are the categorical columns consistent? (correct categories, correct spelling)
Text length is within limits – consider empty strings and nulls
Text data has consistent formatting (phone numbers, postal codes, etc.)
Numeric Unit Uniformity (numbers are in same units – money, temperature etc.)
Datetime Uniformity (mm-dd-yyyy or dd-mm-yyyy or yyyy-mm-dd)
Crossfield Validation (check calculations in calculated columns) and/or create calculated columns
Missing Data

Remove unnecessary columns from the dataset. Locate errors in spelling of words and categories and potentially use string manipulation to fix those errors. Locate outliers.

From a Rows Perspective

Rows. Find missing data in one or more columns of one or more rows. Check for duplicate rows.

Uniqueness constraints (are there any duplicate rows?)
Are there any missing rows? Do we have all the data?

Data Wrangling. What does this mean? Sorting and re-ordering.

From a Multiple-Dataset (multiple table) Perspective

Here we’ll look at multiple ‘tables’ or ‘datasets’ or ‘worksheets’. We can split or combine. There are two general ways to combine two datasets. We can join them based on a key in a row. Those familiar with SQL will recognize this as joining. We can concatenate (stack) datasets on on top of the other as long as the columns are the same data types. SQL users will recognize this as UNION and UNION ALL. One use case of concatenating is when we have transactional data stored in files where each month is in its own file and we need to combine those.

Series Navigation<< Cleaning Data with AlexLoop Through pandas DataFrame >>

BeginCodingNow.com

for data analysts & software developers

for data analysts & software developers

Data Structuring & Cleaning with Mike

From a Columns Perspective

From a Rows Perspective

From a Multiple-Dataset (multiple table) Perspective

Leave a ReplyCancel reply