EDA Cleaning with Pandas


This entry is part 1 of 8 in the series Pandas EDA Cleaning

Exploratory Data Analysis (EDA) has six main practices. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting. This post discusses the third practice, cleaning. EDA is not a step-by-step process you follow like a recipe. It’s iterative and non-sequential.

Here we are dealing with bad data (misspellings), mixed data types, missing data, duplicate data and data outliers. With misspellings of categorical data, such as groups, industry labels, types of things, more than one word for the same group, and so on, we need to fix these.

All of these involve decisions that might or might not result in the removal of rows of data, which will affect the outcomes of your analysis.

Removing duplicates is considered to be part of the data cleaning process. Check out the post called Duplicate Rows in Pandas.

Cleaning also considers categorical data, label encoding (each category is assigned a unique number instead of a qualitative value), and one-hot encoding (a data transformation technique that turns one categorical variable into several binary variables). Cleaning also looks at dummy variables, data ethics and data governance.

Data professionals use input validation to ensure data is complete, error-free and high-quality. This data validation is primarily performed in the Analyze phase of Google’s PACE workflow.

Series NavigationMissing Data in Pandas >>

Leave a comment

Your email address will not be published. Required fields are marked *