EDA Structuring with Pandas


This entry is part 1 of 3 in the series Pandas EDA Structuring

Exploratory Data Analysis (EDA) has six main practices. The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting. This post discusses the second practice, structuring. EDA is not a step-by-step process you follow like a recipe. It’s iterative and non-sequential.

Structuring helps you to organize, gather, separate, group, and filter your data in different ways to learn more about it. It is very important that you do not change the meaning of the data while you are structuring it. In the cleaning phase we also look at categorical data, label encoding (each category is assigned a unique number), dummy variables, input validation, ethics and data governennce.

Value Counts

You can use value_counts() to see which values appear the most in the dataset. Suppose you have a column called my_number. You would type this df.my_number.value_counts() or df[‘My Column’].value_counts(). Notice also that the value_counts() function automatically sorts the results in descending order. To change that add this: .sort_values(ascending=False). Let’s take this a step further and see if these values are evenly distributed or we have spikes or outliers in our data.

df.my_number.value_counts()[:20].rename_axis('unique_values').reset_index(name='counts').style.background_gradient()

The colon 20 gives us the top 20. The rest is for formatting.

The value_counts() function is used to get a Series containing counts of unique values. The resulting object will be in descending order so that the first element is the most frequently-occurring element. Excludes NA values by default. value_counts only works on pandas series, not pandas dataframes. Here is an article on value_counts called 8 Python Pandas Value_counts() tricks that make your work more efficient.

Class Balance and Value_Counts

Suppose you had a column that grouped the rows into just two categories: big spender and not a big spender. The column has the number 1 for big spenders and 0 for others. You want to know the class balance of the column. In other words you want to know the percentages of each. You would use normalize=True so that the object returned will contain the relative frequencies of the unique values. The relative frequencies will be between 0 and 1 for each unique value, which you can easily convert to percentages.

df['spender'].value_counts(normalize=True)

Number of Unique Values in a Column

You can get the number of unique values in a column with this syntax: df[‘column’].nunique(). The df is our DataFrame name, columns is the name of a column in that DataFrame. This just returns a single number.

Unique Values in a Column

Have a look at the post Find Unique Values in a Column.

Sorting

How you you sort a DataFrame? Sorting data within a dataframe can be accomplished using the sort_values() function; the default value of this function is to sort in ascending order. However, you can pass an argument to the function that disables the default behavior, which will return the values in descending order.

Filtering

Suppose we have a DataFrame with a list of companies. There is a column called Industry. The code for filtering a DataFrame for a certain industry would be similar to df_fintech = df[df[‘Industry’] == ‘Fintech’]. We have a post called Filtering a pandas DataFrame for more details.

Concat

Do you need to combine two or more datasets? By combine I mean to add rows to the bottom without changing the columns. If bot datasets have the same columns, concat will work. If you have added extra columns to one of the datasets, concat can still work for you if you drop those added columns so that both datasets have exactly the same number and type of columns.

union_df = pd.concat([df.drop(['weekday','week'],axis=1), df_2], ignore_index=True)

Group

Working with dates is always a bit more challenging. What if we wanted to group by weeks. We have a date column called date in a dataset called df. First we would need to create (if we don’t yet have) a week column.

Drop Columns

How do you delete a column in a pandas DataFrame? Suppose your DataFrame is called df. Suppose your column is called Extra_Col. Here is the code: df = df.drop([‘Extra_Col’], axis=1). The axis set to one means to drop columns (not rows). Alternatively you could save the DataFrame to a new variable, say df2, thereby retaining the old DataFrame df.

Series NavigationData Type Conversion in pandas >>

Leave a Reply