EDA Discovering with Pandas


This entry is part 1 of 3 in the series Pandas EDA Discovery

Exploratory Data Analysis has six phases: discovering, structuring, cleaning, joining, validating and presenting. The first phase is discovery. These phases do not have to happen in the prescribed order. EDA is not a step-by-step process you follow like a recipe. It’s a creative art form that takes practice. It’s iterative and non-sequential. This post will illustrate a few ideas you can use int he discovery phase in Python with the library pandas.

Initial Discovery

When you are using pandas, you will import the data into a DataFrame. Let’s assume you have named the DataFrame df. Let’s also assume you’ve imported pandas with the alias pd.

When you first encounter a dataset you’ll want to have a look at it. After importing it you could use the head() function. There is also tail(), shape, size and info(). To get some basic statistics you can use describe().

Have pandas display all of the columns. You can set the option to ensure that pandas displays all columns in the dataset. This will be important when you have several columns in the dataset.

pd.set_option('display.max_columns', None)
df.head()

Data Types

You can use dtypes and info(). The info() function will give you the data types of your current DataFrame. It’s best to change the data types if necessary. For example, if one of your columns contains date information but is listed as an object by info(), you should change the object to datetime. Objects are strings. How do you convert it? As another example, that’s discussed below, you may have a column that is a string that should really be a number. Python interpreted it as a string because it had a dollar sign before the number and an “M” at the end.

You can get a two-column list of the columns and their data types by specifying the dtypes property of the DataFrame. If your DataFrame is named df, you would use the following code.

df.dtypes

You can get the data type of a column in a pandas DataFrame with the following syntax.

df['my_column'].dtype
print('Data type of my_column:', df['my_column'].dtype)

Here’s how you might convert to datetime.

# convert the date column to datetime
df['date'] = pd.to_datetime(df['date'])

It would be nice to add a new column in the dataset that has the month.

# create a new 'month' column
df['month] = df['date'].dt.month

Let’s create a new column that contains the month as text. Note that we may also want to create a Year column.

# create a new 'month_txt' column
df['month_txt'] = df['date'].dt.month_name().string.slice(stop=3)
df.head()

Create a Helper Dataframe

To further explore the data, let’s look at the totals for each month and then plot them in a bar graph. A helper DataFrame will allow us to easily see the data and work with the data when we plot it.

# create a new helper DataFrame for easier plotting
df_by_month = df.groupby(['month','month_txt']).sum().sort_values('month', ascending=True).head(12).reset_index()

Use the sort_values() function, specifying the by parameter as the name of the column that you want to sort by and the ascending parameter as needed. Note that by default, ascending is set to True. If you want to sort in descending order, specify the ascending parameter as False. Here’s how you can use it.

df.sort_values('My Column', ascending=True).head(10)

Plot a Bar Chart

Assume that in our dataset we have a column called number. We have many dates that we’ve grouped into months. We have summed the numbers. We want to see a plot of each month, starting in January, on the x-axis and the total numbers in the y-axis. To do this we would have needed to import a few libraries. We would normally, by convention, do this at the very top of the file.

import pandas as pd
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
plt.bar(x=df_by_month['month_txt'],height=df_by_month['number'], label="Total Number for Each Month")
plt.plot()
plt.xlabel("")  # just leave these empty to show the general caee and fill in later
plt.ylabel("")
plt.title("")
plt.legend()
plt.show()

If you find that the x-axis labels don’t fit in the small space provided, there are ways to deal with this.

plt.xticks(rotation=45, horizontalalignment='right')

New Calculated Columns

These are easy to do. Above we created a new column called ‘month’. You can use assign() in pandas. We have a post on that called Pandas Assign to Add a Column.

Sampling a Large Dataset

If you have a fairly large dataset with many rows, it might be helpful to speed the EDA process by taking a subset, or random sample of the data until you are ready to work with the whole dataset.

# get 50 random rows and set the random state to a number so that when you run this
# again you will get the same data 
df_sample = df.sample(n=50, random_state=12) 

String Manipulation

You might have a column that is a string that should really be a number, as reported by info(). We need it to be a number because we will be doing math operations on that column. Python interpreted it as a string because it had a dollar sign before the number and an “M” at the end. To solve this we could use str.replace().

df['amount_millions'] = df['amount'].str.replace('$', '')
df['amount_millions'] = df['amount_millions'].str.replace('M', '')
df['amount_millions'] = df['amount_millions].astype('int')
df.head()

Duplicates

Here is our post on duplicates with pandas datasets called Duplicate Rows in Pandas.

Selecting Columns from a DataFrame

Have a look at the pandas documentation webpage How do I select a subset of a DataFrame?

Series NavigationPandas Assign to Add a Column >>

Leave a Reply