Count the Number of Rows with a Condition


Suppose you are working with a pandas DataFrame in Python. You are doing some EDA (exploratory data analysis) and you want to know how many rows of data meet a certain condition for a single column. For example, suppose you have a DataFrame with a column called amount. You want to know how many rows in the amount column have a value of zero.

numRowsZeros = len(df[df['amount']==0])

Here is another example. Suppose we are working with the IMDB movie database data. The titles file has a column called startYear. However, some of those rows don’t have a start year. The data is missing, In those cases, IMDB puts a slash N (\N) into the row instead of the year. Since the backslash needs to be escaped, we can filter the DataFrame into another DataFrame with the following code, or something similar.

df_bad_year = df_movies[df_movies['startYear'] == '\\N'] 

The DataFrame called df_bad_year contains all of the rows that have \N in the startYear. How many rows is that? We can use shape. The shape property returns a tuple containing the shape of the DataFrame. The shape is the number of rows and columns of the DataFrame. It looks like this: (rows, columns). We can use indexing, starting with a zero to get the number of rows. We can use print() to form a coherent sentence for the reader.

print('There are {:,.0f} rows that have \\N in the startYear column.'.format(df_bad_year.shape[0]))

Leave a Reply