Data Structuring & Cleaning with Mike


This entry is part 4 of 5 in the series Data Cleaning

What are the steps to cleaning your data in a data analysis project? This article combines a few sources of information, so I called it “with Mike”. I have another post that is called EDA Cleaning with Pandas. In that article, I break down EDA into six parts. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting.

Before you clean your data you’ll do a few of things first. You want to know the purpose of your project and who the stakeholder are. This article will assume you are using Python as your programming language. You might be using Jupyter Notebook as you programming environment. If you can get hold of a data dictionary, do so. It will be very helpful.

The links in the list below all link to code examples in Python. They are usually linking to code that uses the pandas library.

  1. Import the data (reading files)
  2. Initial Exploratory Data Analysis (EDA)
  3. Drop any Columns we Don’t Need
  4. Rename Columns as necessary (reorder if necessary)
  5. Data Types
  6. Check the numerical data ranges (describe)
  7. Uniqueness constraints (are there any duplicates?)
  8. Check Outliers (statistics and boxplots)
  9. Remove Bad Characters in Text Columns (remove begin and trail spaces, remove Non-alphanumeric)
  10. Explore the Dependent Variable
  11. Are the categorical columns consistent? (correct categories, correct spelling)
  12. Text length is within limits
  13. Text data has consistent formatting (phone numbers, postal codes, etc.)
  14. Numeric Unit Uniformity (numbers are in same units – money, temperature etc.)
  15. Datetime Uniformity (mm-dd-yyyy or dd-mm-yyyy etc.)
  16. Crossfield Validation (check calculations in calculated columns)
  17. Missing Data
Series Navigation<< Cleaning Data with AlexLoop Through pandas DataFrame >>

Leave a Reply