Outliers in Pandas


This entry is part 5 of 8 in the series Pandas EDA Cleaning

Exploratory Data Analysis (EDA) has six main practices. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting. This post discusses the third practice, cleaning. EDA is not a step-by-step process you follow like a recipe. It’s iterative and non-sequential.

There are three kinds of outliers.

  1. Global outliers
  2. Contextual outliers
  3. Collective outliers

In your initial exploratory data analysis, you might want to use describe() to get some descriptive statistics on your data.

Another way to check for outliers is to create a boxplot. Let’s consider a simple example. You have a DataFrame called df. You have a column called amount. You can use seaborn to create a boxplot.

Some models are more sensitive to outliers than others. Deciding whether to remove outliers may depend on the model you are using.

Very Simple Example

Here is a very small dataset that I’ve manually created in a project called Outliers Boxplot in python.

import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
data = {'firstname': ['Bob', 'Sally', 'Suzie', 'Rowan', 'Bart'],
       'amount': [32, 37, 33, 78, 29]}
df = pd.DataFrame(data)

# Create a boxplot to visualize distribution of `amount` and detect any outliers.
# We can see that Rowan is an outlier just by looking at the original small dataset.
plt.figure(figsize=(4,1.5))
plt.title('Boxplot to detect outliers for amount', fontsize=12)
plt.xticks(fontsize=10)
plt.yticks(fontsize=10)
sns.boxplot(x=df['amount'])
plt.show()

Below is the screenshot from Jupyter Notebook.

Series Navigation<< Cleaning Mixed Data TypesData Cleaning – Outliers >>

Leave a Reply