Introduction to Data in R


This entry is part 1 of 2 in the series R Data

Getting your data into R in a useful form for visualization and modeling is a critical skill for a data analyst. This would be part of the Prepare and Process phases of the data analytics life cycle.

A data frame is a collection of columns. It’s a lot like a spreadsheet or a SQL table. Data frames put data it into a format we can easily work with. First, columns should be named. Using empty column names can create problems with your results later on. Columns in data frames can be many different types, like numeric, factor, or character. Often data frames contain dates, time stamps and logical vectors. Finally, each column should contain the same number of data items, even if some of those data items are missing.

In the tidyverse, tibbles are like streamlined data frames. They make working with data easier, but they’re a little different from standard data frames. First, tibbles never change the data types of the inputs. They won’t change your strings to factors or anything else. Tibbles make printing in R easier. They won’t accidentally overload your console because they’re automatically set to pull up only the first 10 rows and as many columns as fit on screen.

Tidy data refers to the principles that make data structures meaningful and easy to understand. It’s a way of standardizing the organization of data within R. These standards are pretty straightforward. Variables are organized into columns. Observations are organized into rows and each value must have its own cell. Getting your data into this tidy format requires some upfront work, but that work pays off in the long run. Have a look in the book R for Data Science Chapter 12.

Series NavigationData Frames in R >>

Leave a Reply