Filter Out Missing Data in Pandas


There are a few ways to filter out missing data in Python. Suppose we are working with pandas. You could do this manually, however you are more likely to consider using dropna() instead. The dropna() function by default removes any rows with any missing values in any of the columns.

With DataFrame objects you will be asking yourself if you want to drop rows or columns that are all NA, or only those rows or columns that have at least one NA. By default, dropna() drops any row containing a missing value. If you pass how=”all” to dropna(), like this dropna(how=”all”) then dropna() will drop only rows that are all NAs.

Suppose you have a dataset stored in Python as a DataFrame. It’s called df. Your goal is to filter the DataFrame. You want to remove all rows of the DataFrame where a particular column is missing data. Suppose the column is called “Sales”. How you you remove all rows in the dataset from which Sales data is missing?

df = df.dropna(subset = ['Sales'], axis = 0)

We must use subset in the code to point to the Sales column of data, not any or all columns of data that have missing values. Axis = 0 says that we are dropping rows. If the axis were 1, we would be asking it to drop columns.

We can use inplace = True instead of re-assigning the results to df. We can look at the results as a test of the code if we want. We would not use inplace = True and we would not reassign it back to df.

What if it’s not a particular column you are concerned with? Suppose you want to drop all rows that have a null in any column. In the code below I have added the optional reset_index().

df_subset = df.dropna(axis=0).reset_index(drop=True)

View All Rows with Nulls

Let’s create a very small dataset with a few missing values. Then we’ll look at all of the rows that have nulls. We’ll see that the code does not include cases where the text is simply an empty string. An empty string is not considered to be null.

# manually create a very simple DataFrame
data = {'firstname': ['Bob', None, 'Suzie', 'Rowan', ''],
       'amount': [12, 67, None, 41, 66]}
df2 = pd.DataFrame(data)
df2

rows_with_nulls = df2[df2.isnull().any(axis=1)]
rows_with_nulls

Visualize Missing Data

You can visualize missing data with seaborn. Check out the post called EDA Discovering with Visuals.

Leave a Reply