Exploratory Data Analysis (EDA) has six main practices. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting. This post discusses the third practice, cleaning. EDA is not a step-by-step process you follow like a recipe. It’s iterative and non-sequential.
A dirty dataset, can be challenging. Data may at first seem structured, but it often can be hard to determine whether there are any data points that are measurably different from others. These extreme data observations, the ones that stand out from others, are known as outliers. Outliers are observations that are an abnormal distance from other values or an overall pattern in a data population. Data professionals should be aware of the highs, lows and mid values of each numerical column in a data frame.
There is no statistical theory that separates outliers from nonoutliers. There are however arbitrary rules of thumb for how distant from the bulk of the data the observation needs to be to be identified as an outlier.
There are three different types of outliers we will discuss: global, contextual, and collective outliers.
Global outliers (also called “point anomalies”)
Global outliers are values that are completely different from the overall data group and have no association with any other outliers. They may be inaccuracies, typographical errors, or just extreme values you typically don’t see in a dataset. Sometimes the outlier is fairly obvious. Typically, global outliers should be thrown out to create a predictive model.
Contextual (conditional) outliers:
Contextual outliers can be harder to identify. Contextual outliers are normal data points under certain conditions but become anomalies under most other conditions. As an example, movie sales are expected to be much larger when a film is first released. If there is a huge spike in sales a decade later, that would typically be considered abnormal or a contextual outlier. These outliers are more common in time-series data. Another example might be an outlier only in a specific single category of data.
Collective outliers:
Collective outliers are a group of abnormal points that follow similar patterns and are isolated from the rest of the population. Consider a parking lot at the mall. To have a full parking lot after the mall is closed would be considered a collective outlier. It could be that there is a local event in the mall, which would explain the outlier of cars parked in the mall parking lot after hours. One useful way to find these different types of outliers in our data is visualization.
Handling Outliers
Once you’ve detected outliers in your dataset—whether global, contextual, or collective—how do you handle them? In EDA, there are essentially three main ways to handle outliers: delete, reassign, or leave them in.
Delete. If you are sure the outliers are mistakes, typos, or errors and the dataset will be used for modeling or machine learning, then you are more likely to decide to delete outliers. Of the three choices, you’ll use this one the least.
Reassign. If the dataset is small and/or the data will be used for modeling or machine learning, you are more likely to choose a path of deriving new values to replace the outlier values. Data imputation is the substitution of an estimated value that is as realistic as possible for a missing or problematic (outlier) data item. The substituted value is intended to enable subsequent data analysis to proceed.
Leave. For a dataset that you plan to do EDA/analysis on and nothing else, or for a dataset you are preparing for a model that is resistant to outliers, it is most likely that you are going to leave them in.
Visually, some of the best ways to identify the presence of outliers in data are box plots and histograms.