Objectives

Learn how to use dplyr to clean and wrangle your data.
Ponder the shocking inequality among countries using the gapminder dataset.
Make your first plot with ggplot2.

Pre-lab assignments

Read R for Data Science Chapters 3.1 to 3.4 and 3.6 and all of Chapter 5

My assumptions

I am now assuming that you’ve taken the time to familiarize yourself with the R language. I will no longer explain basic concepts like data class, data mode, or variables versus observations. If this tutorial leaves you feeling overwhelmed (it shouldn’t!), the take some time to go back and review the Introductory R tutorial from last week.
I won’t clean data for you in this class. In any project, data wrangling/cleaning is most of the work, so you need to be familiar with techinques to clean data! In this tutorial, the dataset is quite clean… this won’t be the case in the future!

Set-up

First, if you haven’t already installed the gapminder package, type install.packages("gapminder") in the console to pull the package off of the CRAN repository onto your computer. The package documentation can be found here and more information about the Gapminder project can be found at www.gapminder.org. Take a second to learn more about the dataset… it’s pretty cool!

library(gapminder)
data(gapminder)

Remember that library() loads the package for us in your current R session and data() pulls the pre-made gapminder dataset into your Global Environment. We’ll learn how to import other types of data in future tutorials.

You’ll also want to be sure to install and load the following packages:

library(tidyverse) # this loads a suite of packages including dplyr and ggplot2

Our data

Let’s inspect the new gapminder dataset:

head(gapminder)

## # A tibble: 6 x 6
##   country     continent  year lifeExp      pop gdpPercap
##   <fctr>      <fctr>    <int>   <dbl>    <int>     <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333       779
## 2 Afghanistan Asia       1957    30.3  9240934       821
## 3 Afghanistan Asia       1962    32.0 10267083       853
## 4 Afghanistan Asia       1967    34.0 11537966       836
## 5 Afghanistan Asia       1972    36.1 13079460       740
## 6 Afghanistan Asia       1977    38.4 14880372       786

Our gapminder dataset includes the following variables:

country
continent
year
lifeExp or life expectancy
pop or population
gdpPercap or GDP per capita

Now let’s figure out what class each variable is. We could do this individually for each variable by typing class(gapminder$variable_name), or we could use the sapply() function to apply the class function across all variables in the gapminder dataset. Remember, if you’re unfamiliar with any function, i.e. sapply(), you can ask R how it works using ?sapply().

sapply(gapminder, class)

##   country continent      year   lifeExp       pop gdpPercap 
##  "factor"  "factor" "integer" "numeric" "integer" "numeric"

Our categorical variables, or variables that take on a number of limited possible values, are coded as "factor" variables. This will become very useful as we start to group data using the dplyr package. Our other variables are coded as numeric or integer.

As shown above, we can extract information for each variable in the dataset using $. For example, if we wanted to determine the range of years in the dataset we can simply type:

range(gapminder$year)

## [1] 1952 2007

So the data runs from 1952 to 2007. Is there data for every year over this period?

unique(gapminder$year)

##  [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007

Good to know. Looks like this dataset only has observations every five years.

Indexing using base `R`

I’ll refer to base R a lot in this class. When I say this, I’m referring to the functions that come in the basic R language and that are not associated with a particular package. The functions we’ve used so far like head() and unique() are base R functions. Later in this tutorial, we’ll start working with functions from the dplyr package that do not come from base R including select(), filter() and arrange(). In general in this class, when I refer to packages, I will just list the package name in this font. For example, I may refer to ggplot2, dplyr, or gapminder. I will refer to functions by including closed parentheses (), i.e. select(), filter(), etc. This is a reminder that functions almost always include agruments which you have to include, i.e. mean(x) tells R to compute the mean of the variable x. Ok, now that we’re up to speed on notation and terms, let’s start playing with our data.

Let’s say you want to extract observations for the country of Sri Lanka. One option is to use base R to index the full dataset to create smaller datasets:

sri_lanka <- gapminder[gapminder$country == "Sri Lanka",]
head(sri_lanka)

## # A tibble: 6 x 6
##   country   continent  year lifeExp      pop gdpPercap
##   <fctr>    <fctr>    <int>   <dbl>    <int>     <dbl>
## 1 Sri Lanka Asia       1952    57.6  7982342      1084
## 2 Sri Lanka Asia       1957    61.5  9128546      1073
## 3 Sri Lanka Asia       1962    62.2 10421936      1074
## 4 Sri Lanka Asia       1967    64.3 11737396      1136
## 5 Sri Lanka Asia       1972    65.0 13016733      1213
## 6 Sri Lanka Asia       1977    65.9 14116836      1349

Don’t forget the , after "Sri Lanka"! This tells R that you want to find ALL rows in which country == "Sri Lanka" and to include ALL columns in the dataset. If you only wanted the first column, the you could type gapminder[gapminder$country == "Sri Lanka", 1]. Remember that in R, =is used for assignment, i.e. to create a variable such as x = 5. == is used for equality testing, i.e. if we want to confirm that the variable x we just created is, in fact, equal to 5, we could type x==5. The console should return TRUE. Try it!

Back to the indexing. If we didn’t want ALL of the columns and only wanted the variable gdpPercap for Sri Lanka, we could do the following:

sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", "gdpPercap"]
head(sri_lanka_gdp)

## # A tibble: 6 x 1
##   gdpPercap
##       <dbl>
## 1      1084
## 2      1073
## 3      1074
## 4      1136
## 5      1213
## 6      1349

Since we only want a single variable, this returns a vector listing all observations of Sri Lankan GDP per capita over the years included in the dataset. This isn’t very useful because we don’t know what years are associated with each observation. Let’s pull out yearly data too.

sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", c("year", "gdpPercap")]
head(sri_lanka_gdp)

## # A tibble: 6 x 2
##    year gdpPercap
##   <int>     <dbl>
## 1  1952      1084
## 2  1957      1073
## 3  1962      1074
## 4  1967      1136
## 5  1972      1213
## 6  1977      1349

Here we create a list of column names we want to pull out of the dataset using c() which combines values into a vector or list. Confused? Don’t worry, things will get easier as we introduce dplyr.

Indexing using `dplyr`

The dplyr package is one of a number of packages in the tidyverse set of packages that makes data wrangling, indexing, and plotting much easier (and, dare I say, fun?). In this class, we’ll frequently use this set of packages. Yes, you can do the same things using base R, but while they may seem a bit trickier initially, the tools in the tidyverse are extremely powerful and worth learning. Not convinced? Check this out..

Instead of indexing with [,], $ and other symbols, dplyr contains several functions that make data organization much simpler:

select(): select columns
filter(): select rows
arrange(): order or arrange rows
mutate(): create new columns
summarize(): summarize values (for the Brits in the room, you can also use summarise())
group_by(): group observations

Ok, let’s try to create the same Sri Lanka dataset with year and GDP per capita using the dplyr dataset. Don’t forget to load either the tidyverse package which contains dplyr (library(tidyverse)) or just dplyr using library(dplyr) if you haven’t already:

sri_lanka_gdp <- gapminder %>%
  filter(country == "Sri Lanka") %>%
  select(year, gdpPercap)
head(sri_lanka_gdp)

## # A tibble: 6 x 2
##    year gdpPercap
##   <int>     <dbl>
## 1  1952      1084
## 2  1957      1073
## 3  1962      1074
## 4  1967      1136
## 5  1972      1213
## 6  1977      1349

A few things are happening here. First, you’re probably wondering what that crazy %>% thing is. I’ll get there. First, let’s look at the two functions I’m using to create the dataset. Since we only care about Sri Lanka, we start by filtering our the rows in which country=="Sri Lanka". We then select the columns we’re interested in, year and gdpPercap. So what is the %>% thing about? This is called a pipe. This allows you to “pipe” the output from one function to the input of another. In this example, we start with the full gapminder dataset. We feed the full dataset into the the filter() function, which selects only rows for Sri Lanka. We then feed this new Sri Lankan dataset into the select() function to select only columns of interest to us. This keeps us from having to create two separate data.frames or from complicated indexing (i.e. c("gdpPercap", etc)). It also is very easy for other R programmers to read because it reads like plain old English.

Not convinced yet that this trumps base R? Ok, say you want to know the average, maximum, and minimum GDP for Sri Lanka over the last 50 years. No problem:

gapminder %>%
  filter(country == "Sri Lanka") %>%
  select(year, gdpPercap) %>%
  summarize(avg_gdp = mean(gdpPercap), max_gdp = max(gdpPercap), min_gdp = min(gdpPercap))

## # A tibble: 1 x 3
##   avg_gdp max_gdp min_gdp
##     <dbl>   <dbl>   <dbl>
## 1    1855    3970    1073

The summarize() function takes all rows in each columns and applies a function to these rows. mean(gdpPercap) takes the mean of all observations of gdpPercap for Sri Lanka and returns the average, summarized as the new variable avg_gdp.

Let’s go a bit further. Let’s say you want to know which countries have the highest average life expectancy:

gapminder %>%
  group_by(country) %>%
  summarize(mean_le = mean(lifeExp)) %>%
  arrange(desc(mean_le))

## # A tibble: 142 x 2
##    country     mean_le
##    <fctr>        <dbl>
##  1 Iceland        76.5
##  2 Sweden         76.2
##  3 Norway         75.8
##  4 Netherlands    75.6
##  5 Switzerland    75.6
##  6 Canada         74.9
##  7 Japan          74.8
##  8 Australia      74.7
##  9 Denmark        74.4
## 10 France         74.3
## # ... with 132 more rows

So you can expect Bjork to live forever, which is honestly great news. Notice the use of the group_by() function. This is an important step. It groups the data by country (you could also group by year or any other categorical variable) and then applies the function specified in the summarize() function to each group of data, in our case, to each country. We use the function mean(), but you can apply any function you can find (or build!) to this grouped data. I find this simpler and much easier to read than answering the same question using base R, and this is why we as a class will invest time and energy in learning how to become tidyverse masters. I should add, however, that I’m a big proponent of the you do you philosophy, so if you feel strongly attached to base R and chose to use base R to work with your data, you do you. I’ll also say that on big projects, you tend to use a bit of both, so be sure you’ve reviewed the base R resources provided last week.

What if we want to add a new variable to our dataset, say an indicator of whether or not a country is located in the continent Africa. Enter the mutate() variable which allows us to easily add new variables to our data.frame.

africa <- gapminder %>%
  mutate(africa = ifelse(continent == "Africa", 1, 0))

head(africa)

## # A tibble: 6 x 7
##   country     continent  year lifeExp      pop gdpPercap africa
##   <fctr>      <fctr>    <int>   <dbl>    <int>     <dbl>  <dbl>
## 1 Afghanistan Asia       1952    28.8  8425333       779      0
## 2 Afghanistan Asia       1957    30.3  9240934       821      0
## 3 Afghanistan Asia       1962    32.0 10267083       853      0
## 4 Afghanistan Asia       1967    34.0 11537966       836      0
## 5 Afghanistan Asia       1972    36.1 13079460       740      0
## 6 Afghanistan Asia       1977    38.4 14880372       786      0

This creates a new data.frame to which I’ve added a new variable called africa that is equal to 1 if the observation is located in the continent of Africa and 0 if it is not. The ifelse() function is quite useful, check it out using ?. This has been your quick intro to dplyr. For more examples of data wrangling and manipulation with dplyr, I recommend this post. as well as the pre-assignment readings written by the dplyr creator Hadley Wickham.

Plotting using `ggplot2`

Manipulating data.frames is all fine and good, but the fun (yes, I’m telling you, this stuff can be fun!) really starts when you start visualizing data. Yet again, the tidyverse dominates with a powerful package called ggplot2. ggplot2 makes it easy for you to create beautiful data visualizations. Check out the ggplot gallery if you don’t believe me! This lab will give you a very short introduction to ggplot2. We’ll spend more time on this package in the following weeks and eventually learn how to plot spatial data with ggplot2 (and other packages).

Let’s start by plotting data from a single country. Let’s say we’re interested in how life expectancy has changed from the 1950s to present in the United States. Well, with dplyr it’s now easy for us to pull out data for the U.S. from our larger data.frame and assign it to a new, smaller data.frame called us.

us <- gapminder %>%
  filter(country == "United States")

Easy. Now to plot this. When plotting with ggplot you start by calling the ggplot() function. This creates a blank plot with a coordinate system that you can add data to. The first argument of the ggplot() functions is the dataset you want to plot. In our case, this is the us data.frame we just built:

ggplot(data=us)

For now the plot is blank because we haven’t told ggplot2 how to deal with the data. We can add additional layers to the plot by using the + symbol. Each layer provides more information about how we’d like to plot the data. Say we want to plot points indicating life expectancy through time. You can get a full list of the types of plots at this website under the Layer:geoms section. For now, we’ll use the geom_point function to plot points:

ggplot(data=us) +
  geom_point(mapping = aes(x = year, y = lifeExp))

geom_point() takes a mapping argument in which we specify the aesthetics aes and indicate which variable we’d like to plot on the x axis (year) and which we’d like to plot on the y axis lifeExp. I know, this isn’t the most elegant way to do this, but once you get past the mapping/aesthetic specifications, adding additional detail is very easy.

Let’s fix a few new details to our plot. It could use better axis labels and a clear title. It could also be nice to change the color of the points to make them stand out a bit more:

ggplot(data=us) +
  geom_point(mapping = aes(x = year, y = lifeExp), color = "blue") +
  ggtitle("Life expectancy in the U.S.") +
  xlab("Year") +
  ylab("Life expectancy")

Much better. Depending on what we want to plot, we can change the geometry object we use. If we want a smoothed line plot, we could use geom_smooth:

ggplot(data=us) +
  geom_smooth(mapping = aes(x = year, y = lifeExp), color = "blue") +
  ggtitle("Life expectancy in the U.S.") +
  xlab("Year") +
  ylab("Life expectancy")

## `geom_smooth()` using method = 'loess'

Another cool trick is that we can pipe %>% our dplyr manipulations straight into a ggplot(). Let’s make a life expectancy plot for Sierra Leone using this approach:

gapminder %>%
  filter(country == "Sierra Leone") %>%
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = lifeExp), color = "blue") +
  ggtitle("Life expectancy in Sierra Leone") +
  xlab("Year") +
  ylab("Life expectancy")

## `geom_smooth()` using method = 'loess'

What if we wanted to compare life expectancy in the United States with life expectancy in Sierra Leone in a single plot?

gapminder %>%
  filter(country %in% c("Sierra Leone", "United States")) %>%
  ggplot() +
  geom_smooth(mapping = aes(x = year, y = lifeExp, color = country)) +
  ggtitle("Comparing life expectancy") +
  xlab("Year") +
  ylab("Life expectancy")

## `geom_smooth()` using method = 'loess'

If you want to filter using multiple criteria, use %in% rather than ==. This selects all rows with country equal to the countries in the list we created using c(). Since our data.frame contains information from two countries, we can add an argument to the geom_smooth() function that tells ggplot() to group observations by country and to symbolize them using two different colors. We do this by adding the argument color=country.

Additional resources

In this lab, I’ve reviewed how to subset and wrangle your data using dplyr. You can also do this in base R and it’s often quite useful to know how to do this. I recommend the learnR tutorial or these tutorials on data subsetting, data manipulation. Make sure you’re familiar with how to index and wrangle data in base R before we proceed!
For more complex data merges and joins, check out this overview.

Managing your data

Dr. Emily Burchfield

Objectives

Pre-lab assignments

My assumptions

Set-up

Our data

Indexing using base `R`

Indexing using `dplyr`

Plotting using `ggplot2`

Additional resources

Managing your data

Dr. Emily Burchfield

Objectives

Pre-lab assignments

My assumptions

Set-up

Our data

Indexing using base R

Indexing using dplyr

Plotting using ggplot2

Additional resources

Indexing using base `R`

Indexing using `dplyr`

Plotting using `ggplot2`