Data Structuring & Cleaning with Mike

This entry is part 4 of 5 in the series Data Cleaning

What are the steps to cleaning your data in a data analysis project? This article combines a few sources of information, so I called it “with Mike”. I have another post that is called EDA Cleaning with Pandas. In that article, I break down EDA into six parts. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting.

Before you clean your data you’ll do a few of things first. You want to know the purpose of your project and who the stakeholder are. This article will assume you are using Python as your programming language. You might be using Jupyter Notebook as you programming environment. If you can get hold of a data dictionary, do so. It will be very helpful.

The links in the list below all link to code examples in Python. They are usually linking to code that uses the pandas library.

Import the data (reading files)
Initial Exploratory Data Analysis (EDA)
Drop any Columns we Don’t Need
Rename Columns as necessary (reorder if necessary)
Data Types
Check the numerical data ranges (describe)
Uniqueness constraints (are there any duplicates?)
Check Outliers (statistics and boxplots)
Remove Bad Characters in Text Columns (remove begin and trail spaces, remove Non-alphanumeric)
Explore the Dependent Variable
Are the categorical columns consistent? (correct categories, correct spelling)
Text length is within limits
Text data has consistent formatting (phone numbers, postal codes, etc.)
Numeric Unit Uniformity (numbers are in same units – money, temperature etc.)
Datetime Uniformity (mm-dd-yyyy or dd-mm-yyyy etc.)
Crossfield Validation (check calculations in calculated columns)
Missing Data

Series Navigation<< Cleaning Data with AlexLoop Through pandas DataFrame >>

BeginCodingNow.com

for data analysts & software developers

for data analysts & software developers

Data Structuring & Cleaning with Mike

Leave a ReplyCancel reply