Exploring Data in R


In the data analytics life cycle, in the second phase, you will be preparing your data. This involves gathering your data and exploring it. Exploring your data is used to get to know that rows and columns of your data. It is called exploratory data analysis (EDA). Do you have the data you need to answer the original business problem or opportunity? After you explore the data, you will want to clean it. That’s the Process phase of the data analytics life cycle.

In R, what are some of the functions you can use to explore data. Before exploring the data in R, you will need to load it. You can load some practice datasets such as palmer penguins, diamonds, msleep or mtcars.

You can view the data in the RStudio viewer with the command View(dataset), where you replace dataset with the name of your dataset.

We’ll need to install some packages. Let’s install the Here, Skimr, and Janitor packages. After it’s done installing, we’ll still need to load it with the library() function. We’ll open our console in RStudio to do this. The here packages make referencing files easier.

install.packages("here")
library("here")

Do the same for Skimr and Janitor, but put skimr and janitor in small case when you run the code.

The Skimr package makes summarizing data really easy and lets you skim through it more quickly. The Janitor package has functions for cleaning data. We want to make sure the dplyr package is loaded since we are going to be using some of its features. There are some functions we can use to get summaries of our data frame. They are skim_without_charts(), head(), glimpse(), and select(). Also, make sure the dplyr package is installed and loaded. The clean_names function in the Janitor package will automatically make sure that the column names are unique and consistent. This ensures that there are only characters, numbers, and underscores in the names. To run skim_without_charts() you need to first install and load skimr. We can also use the function head(). We can get the first 6 rows with the Head() function.

Select()

If we only want to see certain columns from our dataset, we can use the select() function. In RStudio, write an R script. we can use the following code. Suppose we just want to see the species column in the penguins data frame. In SQL this would be SELECT species FROM penguins.

penguins %>%
  select(species)

If we want all columns except species, we put a minus sign in front of species.

penguins %>%
  select(-species)
penguins %>% select(species, island)

head Function

With large datasets, like diamonds that has 53,940 rows, we can use the head function to take a quick look at the data itself.

str Function

The structure function returns the column names, the type of data and a few first data elements.

colnames Function

The column names function will return

glimpse Function

The glimpse function gives you the number of rows and columns, the names of the columns, their data types and a few of the first rows of data.

Leave a Reply