- EDA Cleaning with Pandas
- Missing Data in Pandas
- Duplicate Rows in Pandas
- Cleaning Mixed Data Types
- Outliers in Pandas
- Data Cleaning – Outliers
- Data Imputation of Age
- Data Imputation of Negative Numbers
Exploratory Data Analysis (EDA) has six main practices. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting. This post discusses the third practice, cleaning. EDA is not a step-by-step process you follow like a recipe. It’s iterative and non-sequential.
Do you notice that some of the numbers in your dataset (data table) seem to be incorrect? Are they mistakes? Can you do some investigation to fix them? How can you handle this situation?
Data imputation is one way to deal with this problem. Data imputation is the substitution of an estimated value that is as realistic as possible for a missing or problematic data item. The substituted value is intended to enable subsequent data analysis to proceed.
Example
Suppose you have a column of data in your table that has a few negative numbers. The column represents a sale of a product or service. It could be a return of an item. What if the column represents a duration of time that a service took? What if the column represents a distance between two cities? Negative numbers may not make sense in some cases. If there are not too many negative numbers in your dataset, you might choose to impute a value of zero for all negative numbers in that column.
Python and Pandas
In pandas you have a DataFrame called df. Your column is called duration. How would you replace all negative numbers with a zero?
# Impute duration values less than 0 with 0 in the dataset df. df.loc[df['duration'] < 0, 'duration'] = 0 df['duration'].min()
For comparison to SQL, have a look at the post here called SQL Update Statement.
High Outliers
Suppose you see that the maximum value of one of your columns is very extreme. You decide to impute some of the values. Let’s “cap” some of those values. It turns out that the standard formula happens to be Q3 + (1.5 * IQR). That’s the third quartile as Q3 and the inter quartile range is shown here as IQR. However, you can use whatever number you want in substitute for the 1.5 in the formula.