Factors in R


In the R programming language, what is a factor? Factors are used to work with categorical variables, variables that have a fixed and known set of possible values. If you have a variable called Weekday, it can have the following seven possible values: Sunday, Monday, Tuesday, Wednesday, Thursday, Friday and Saturday.

For more information, have a look at the online book called R for Data Science. Chapter 15 Factors. This is a very good, free online resource.

Suppose we have a variable that takes the month as characters. We have values like “Jan”, “Feb” and “Mar”. We know that there are only 12 possible months. We could use strings, but there is a better way. Right now we have two problems. We could have typos entered into the variable. Also, our month doesn’t sort properly. They sort alphabetically, not chronologically. Factors help us to solve both problems. Here is an RStudio console session.

> x1 <- c("Dec", "Apr", "Jan", "Mar")
> x1
[1] "Dec" "Apr" "Jan" "Mar"
> sort(x1)
[1] "Apr" "Dec" "Jan" "Mar"

Levels

Let’s try this the better way. You can fix both of these problems with a factor. To create a factor you must start by creating a list of the valid levels.

> month_levels <- c(
+     "Jan", "Feb", "Mar", "Apr", "May", "Jun", 
+     "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"
+ )
>

Next you can create a factor.

> y1 <- factor(x1, levels = month_levels)
> 
> y1
[1] Dec Apr Jan Mar
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
> sort(y1)
[1] Jan Mar Apr Dec
Levels: Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
> 

Any values not in the set will be silently converted to NA. Let’s create a new variable called x2. It has a typo for the month of January. It has “Jim” instead of “Jan”.

Data Frame

Let’s manually create a data frame. This below is a script in R.

id <- c(1:3)
name <- c("Susan Smith", "Rachel Hickman", "Bob Johnson")
job_title <- c("Clerical", "President", "Management")
month_start <- c("Dec", "Jan", "Apr")
month_levels <- c("Jan", "Feb", "Mar", "Apr", "May", "Jun",  "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
month_begin <- factor(month_start, levels = month_levels)
month_begin <- month_start
employee <- data.frame(id, name, job_title, month_begin)
employee
employee %>% count(month_begin)

If we run View(employee) we can then see the data in the RStudio viewer, as shown in the screenshot below.

Leave a Reply