Cleaning Data in R


Now we need to clean, standardize and manipulate our data. Usually, before we begin cleaning the data we explore the data. After exploring the data we can devise a plan to clean it. In the data analytics life cycle, we are in the Process phase when we are exploring and cleaning the data. You might also need to transform your data. Cleaning data’s objective is to remove anything that could cause an error during analysis.

Loading Data

To practice, you could load the palmer penguins package with library(“palmerpenguins”) in the console of RStudio. There’s a few different functions that we can use to get summaries of our data frame. Skim_without_charts, glimpse, head, and select.

Below are four packages that help with cleaning data in R. You could run these at the console if at some point in the past you installed these packages. I’m using RStudio.

> library(here)
> library(skimr)
> library(janitor)
> library(dplyr)

Rename

You rename columns with the rename() function. To change the name of our island column to island_new we could use the following script.

penguins %>%
  rename(island_new=island)

To make all of your column names in upper case, try rename_with(penguins, toupper). You can also use rename_with(penguins, tolower) since we usually want our names in lower case. To ensure that there are only characters, underscores and numbers in the names, use clean_names(penguins) on the penguins dataset.

Arrange (Order By/Sort)

Using the penguin data in RStudio, we can write the script below to sort the data by bill_length ascending. You need to have tidyverse loaded to do this.

penguins %>% arrange(bill_length)

Instead of using the pipe you can simply order it this way: arrange(penguins, bill_length).

We have more information on arrange in our post called Sorting Data in R.

Group By

You’ll need to load tidyr this way: library(“tidyr”) before you can run drop_na. Have a look at the post group_by Function in R.

penguins %>%
  group_by(island) %>%
  drop_na() %>%
  summarize(mean_bill_length_mm = mean(bill_length_mm))

Maximum

Suppose we want the maximum bill length for each of the three islands.

penguins %>% 
  group_by(island) %>%
  drop_na() %>%
  summarize(max_bill_length_mm = max(bill_length_mm))

penguins %>%
  group_by(species, island) %>%
  drop_na() %>%
  summarize(max_bill_length = max(bill_length_mm), mean_bill_length = mean(bill_length_mm))

Filter

Here’s how we can filter data. Two examples.

penguins %>% filter(species == "Adelie")
penguins %>% filter(species != "Adelie")

We can combine these features. The filter function in R is like the WHERE clause in the SQL SELECT statement.

We have another post on filtering in R.

penguins %>%
  filter(species == "Adelie") %>%
  group_by(species, island) %>%
  drop_na() %>%
  summarize(max_bill_length = max(bill_length_mm), mean_bill_length = mean(bill_length_mm))

Splitting Data

Sometimes you need to split data in a column. Suppose you have first and last name in one column called name and you want to split those into fname and lname. How do you do that? You can use the separate() function in R.

Combine, Concatinate, Unite

In R you can use the unite() function to combine strings together. In this example, the new column name is arrival_month_year. The c() is the concatenate function.

unite(arrival_month_year, c("arrival_date_month", "arrival_date_year"), sep = " ")

The unite() function does not require the c() function inside it. The c() function is the concatenate function.

Add a New Calculated Column

As an example, suppose in the penguin’s dataset we have body mass in grams and we want to create a new column that shows those numbers in kilograms. We could use mutate.

Mutate

We have another post on mutate called mutate Function in R. You can also use the mutate() function to make changes to your columns. Let’s say you wanted to create a new column that summed up all the adults, children, and babies on a reservation for the total number of people. Here below is a code chunk. The first three characters are back ticks that can be found in the upper left part of the keyboard to the left on the number one key. These create a chunk.

```{r}
example_df <- bookings_df %>%
  mutate(guests = adults + children + babies)

head(example_df)
```

Leave a Reply