Exploratory Data Analysis (EDA) has six main practices. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting. This post discusses the third practice, cleaning. EDA is not a step-by-step process you follow like a recipe. It’s iterative and non-sequential.
Data encoded as N/A, NaN, or a blank is defined as a value that is not stored for a variable in a dataset. This is not the same as a data point of zero, which may be a missing value or a legitimate data point.
What Do We Mean By Missing?
You might find that a dataset will have it’s own way of signaling a missing value, such as “\N” in the IMDB dataset. You might also know that SQL Server uses NULL to signal a missing value. Perhaps the data is simply blank. Suppose the data appears to be blank but is actually a single space. This is a little more involved that first appears.
Below is some Python code used to count the number of missing values in each column of a DataFrame. The DataFrame is called df in the example below. This is a great piece of code to run in you exploratory data analysis (EDA).
df.isnull().sum()
Instead of isnull() you could use isna().
You can use the hasnans. On a series (pandas column), we actually have a convenient attribute available called hasnans. This is going to return a boolean that will tell you whether or not the series has any remaining NaN values.
Note that if you have a missing value in your pandas numeric column and you are importing the data, pandas will always import that column as floating points even if the data exists as integers in the original source.
Whenever you’re dealing with numbers, Pandas can’t convert a missing value (NaN) to any kind of integer or floating point value representation. So, it’s going to raise an exception. It’s going to give you an error. So, before you begin any kind of conversion operation, it’s usually a good idea to make sure you have a column that is full of values.
# Check for missing values in dataframe called df print('Total count of missing values:', df.isna().sum().sum()) # Display missing values per column in dataframe print('Missing values per column:') df.isna().sum()
Start with .isna() to get Booleans indicating whether each value in the data is missing. Then, use .any(axis=1) to get Booleans indicating whether there are any missing values along the columns in each row. Finally, use .sum() to get the number of rows that contain missing values.
Drop the Rows with Missing Values
How would you drop the rows with missing values and save the resulting pandas DataFrame to a variable named df_subset?
df_subset = df.dropna(axis=0).reset_index(drop = True)
Note that in the above code, the resetting of the index is optional.
Removing Rows
I have another post called Filter out missing data in Pandas. If you decide to do this, it’s recommended to store the results in a new DataFrame.
Imputation
You could substitute data in for the missing values. You could use another column in the dataset to decide what those substitute values should be. As an example, have a look at the blog Data Imputation of Age, which shows this with the Titanic dataset.
Visualize Missing Data with Seaborn
Udemy’s course called Python for Data Science and Machine Learning Bootcamp has a video called 90. Logistic Regression with Python Part 1 where the instructor shows us how to do this. His code uses train instead of df. He’s using the Titanic dataset at Kaggle.
import seaborn as sns sns.heatmap(df.isnull(), yticklabels=False, cbar=False, cmap='viridis')