Exploratory Data Analysis Overview 2


Exploratory Data Analysis (EDA) is a very important process for the data analytics professional because you cannot do anything with first understanding the data, and EDA gives you that. I’ve been guilty of rushing through this phase, only to have to go back and look more closely. One way to start this process is to fully document the data dictionary. This involves looking at the data in various ways, such as raw, tables and charts. You need to understand what types of data exist. You will want to understand each of the variables. What format is the data in? Is it narrow or wide? Is the data categorical or numerical? If it is numerical, is it continuous or discrete? What do the distributions look like? Are there any correlations and if so what do they look like? Can we plot them in a scatter plot? Are there any extreme outliers and why? Do we have any missing data? Do we have bad data, such as negative lengths of time or distance?

That’s a lot of data science and data analytics to think about, but it can be learned if we break it down into its pieces and look at each piece before putting it all together.

Exploratory data analysis or EDA is the process of investigating, organizing and analyzing data sets. And summarizing their main characteristics often employing data wrangling and visualization methods.

The six main practices of EDA are discovering, structuring, cleaning, joining, validating and presenting. Are you working in pandas? We have a series of posts that cover these using pandas. The first post is EDA Discovery with Pandas.

Exploratory data analysis (EDA) is not a step-by-step process you follow. Instead, the six practices of EDA are iterative (repetitive) and non-sequential (“forward and back again”). Data scientists expect to perform the practices of EDA multiple times on a dataset before they feel comfortable declaring it “clean” and ready for modeling or machine learning algorithms. The approach is like the agile approach, not the waterfall approach. The underlying idea is that often people do not know. Every dataset is different. People can miss something and they can make mistakes. Previous steps need to be reviewed and updated. Even yes-no survey questions need to have a don’t know and refused response.

The discovering step would involve looking at the data, learning what each column represents (perhaps from a data dictionary) and observing how many rows of data there are. You will need to also understand data types and the data’s source. Deleting duplicate rows is considered cleaning. Adding a new calculated column is structuring. Data visualizations are used throughout EDA.

Structuring

Structuring helps you to organize, gather, separate, group, and filter your data in different ways to learn more about it. First on the list of structuring methods is sorting, which is the process of arranging data into meaningful order. Another structuring method is filtering. Filtering is the process of selecting a smaller part of your dataset (certain rows) based on specified parameters and using it for viewing or analysis. Grouping is another structuring method. Grouping sometimes called bucketizing, is aggregating individual observations of a variable into groups. It’s essential that you do not change the meaning of the data while performing your filtering, sorting, slicing, joining, and merging operations.

Python

Discovery using Python.

Structuring with Python.

R Language

In R, in the discovery phase, you can use glimpse or skim_without_charts. Also you can use head() and summary(). You can use dimension(), which is very similar to pandas’ (Python) shape.


Leave a Reply

2 thoughts on “Exploratory Data Analysis Overview

  • Anonymous

    Thanks for giving this sort of nice information to all of us and please keep us updated in future also. I want to share some information about the Data Structures And AlgorithmsFor best career in Data science And web development Join skillslash it is best online platform For learning Data science And web development Courses for more information go through website links down :
    Data science course in chennai
    Data science course in Bangalore
    Data science course in Pune