Pandas DataFrames EDA


This entry is part 2 of 2 in the series pandas DataFrame

Perhaps you have a small dataset in an external file and you want to begin exploring the data in Python with the use of pandas. What are some of the steps you need to do to explore the data? This is called exploratory data analysis (EDA). EDA is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often using visualization methods. This post will not expore visualization methods.

Once you have you DataFrame, there are some things you can do right away to explore your data to see what you are working with. What are some of those things? Suppose you have named your DataFrame df. Here are a few ideas: head, tail, info, shape, dtypes, len, value_counts, describe, and sort_values. How about using unique()?

Discovery

  1. open Jupyter Notebook, give it a file name and some markdown description
  2. import statements for numpy and pandas
  3. read in the file, perhaps with read_csv() or some other method
  4. if desired create a DataFrame (df) copy with df = df0.copy()
  5. Use the head(10) and tail(10) methods to look at some data
  6. use DataFrame.dtypes to get the column name and data type back
  7. use the info() method to get metadata and describe() method to get some summary statistics
  8. for each categorical column, use the value_counts() method
  9. Create a new DataFrame by using the sort_values() method of the DataFrame; use head()
  10. filter the dataset with Boolean Masking, perhaps on a category of interest; create a new DataFrame
  11. get the head() of the new DataFrame and validate it with shape()
  12. if desired, use iloc[] to select specific rows

Your initial functions or properties to use on your dataset will be head(), tail(), info(), describe() and shape. There is also sample(), size and dtypes. The data type is very important. You may have datetimes that first come in as an object. It’s best to convert from an object to a datetime with pd.to_datetime().

More Information here at begincoding.now

For initial exploration, please look at the post EDA Discovery with Pandas.

Structuring with Pandas.

Learn with YouTube

There is a video by Alex the Analyst called Exploratory Data Analysis in Pandas | Python Pandas Tutorials.

There are some things to look for in your DataFrame/dataset. Data types are one thing. Use dtypes. What is you have a string that really should be an integer? What do you do?

The website Data to Fish as a few examples of working with Pandas. Here is one on how to sort a DataFrame.

Series Navigation<< Pandas DataFrame Introduction

Leave a Reply