dplyr
to clean and wrangle your data.gapminder
dataset.ggplot2
.R
language. I will no longer explain basic concepts like data class, data mode, or variables versus observations. If this tutorial leaves you feeling overwhelmed (it shouldn’t!), the take some time to go back and review the Introductory R tutorial from last week.First, if you haven’t already installed the gapminder
package, type install.packages("gapminder")
in the console to pull the package off of the CRAN
repository onto your computer. The package documentation can be found here and more information about the Gapminder project can be found at www.gapminder.org. Take a second to learn more about the dataset… it’s pretty cool!
library(gapminder)
data(gapminder)
Remember that library()
loads the package for us in your current R
session and data()
pulls the pre-made gapminder
dataset into your Global Environment. We’ll learn how to import other types of data in future tutorials.
You’ll also want to be sure to install and load the following packages:
library(tidyverse) # this loads a suite of packages including dplyr and ggplot2
Let’s inspect the new gapminder
dataset:
head(gapminder)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779
## 2 Afghanistan Asia 1957 30.3 9240934 821
## 3 Afghanistan Asia 1962 32.0 10267083 853
## 4 Afghanistan Asia 1967 34.0 11537966 836
## 5 Afghanistan Asia 1972 36.1 13079460 740
## 6 Afghanistan Asia 1977 38.4 14880372 786
Our gapminder
dataset includes the following variables:
country
continent
year
lifeExp
or life expectancypop
or populationgdpPercap
or GDP per capitaNow let’s figure out what class
each variable is. We could do this individually for each variable by typing class(gapminder$variable_name)
, or we could use the sapply()
function to apply the class
function across all variables in the gapminder
dataset. Remember, if you’re unfamiliar with any function, i.e. sapply()
, you can ask R
how it works using ?sapply()
.
sapply(gapminder, class)
## country continent year lifeExp pop gdpPercap
## "factor" "factor" "integer" "numeric" "integer" "numeric"
Our categorical variables, or variables that take on a number of limited possible values, are coded as "factor"
variables. This will become very useful as we start to group data using the dplyr
package. Our other variables are coded as numeric
or integer
.
As shown above, we can extract information for each variable in the dataset using $
. For example, if we wanted to determine the range of years in the dataset we can simply type:
range(gapminder$year)
## [1] 1952 2007
So the data runs from 1952 to 2007. Is there data for every year over this period?
unique(gapminder$year)
## [1] 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 2002 2007
Good to know. Looks like this dataset only has observations every five years.
R
I’ll refer to base R
a lot in this class. When I say this, I’m referring to the functions that come in the basic R
language and that are not associated with a particular package. The functions we’ve used so far like head()
and unique()
are base R
functions. Later in this tutorial, we’ll start working with functions from the dplyr
package that do not come from base R
including select()
, filter()
and arrange()
. In general in this class, when I refer to packages, I will just list the package name in this font.
For example, I may refer to ggplot2
, dplyr
, or gapminder
. I will refer to functions by including closed parentheses ()
, i.e. select()
, filter()
, etc. This is a reminder that functions almost always include agruments which you have to include, i.e. mean(x)
tells R
to compute the mean of the variable x
. Ok, now that we’re up to speed on notation and terms, let’s start playing with our data.
Let’s say you want to extract observations for the country of Sri Lanka. One option is to use base R
to index the full dataset to create smaller datasets:
sri_lanka <- gapminder[gapminder$country == "Sri Lanka",]
head(sri_lanka)
## # A tibble: 6 x 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Sri Lanka Asia 1952 57.6 7982342 1084
## 2 Sri Lanka Asia 1957 61.5 9128546 1073
## 3 Sri Lanka Asia 1962 62.2 10421936 1074
## 4 Sri Lanka Asia 1967 64.3 11737396 1136
## 5 Sri Lanka Asia 1972 65.0 13016733 1213
## 6 Sri Lanka Asia 1977 65.9 14116836 1349
Don’t forget the ,
after "Sri Lanka"
! This tells R
that you want to find ALL rows in which country == "Sri Lanka"
and to include ALL columns in the dataset. If you only wanted the first column, the you could type gapminder[gapminder$country == "Sri Lanka", 1]
. Remember that in R
, =
is used for assignment, i.e. to create a variable such as x = 5
. ==
is used for equality testing, i.e. if we want to confirm that the variable x
we just created is, in fact, equal to 5, we could type x==5
. The console should return TRUE
. Try it!
Back to the indexing. If we didn’t want ALL of the columns and only wanted the variable gdpPercap
for Sri Lanka, we could do the following:
sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", "gdpPercap"]
head(sri_lanka_gdp)
## # A tibble: 6 x 1
## gdpPercap
## <dbl>
## 1 1084
## 2 1073
## 3 1074
## 4 1136
## 5 1213
## 6 1349
Since we only want a single variable, this returns a vector
listing all observations of Sri Lankan GDP per capita over the years included in the dataset. This isn’t very useful because we don’t know what years are associated with each observation. Let’s pull out yearly data too.
sri_lanka_gdp <- gapminder[gapminder$country == "Sri Lanka", c("year", "gdpPercap")]
head(sri_lanka_gdp)
## # A tibble: 6 x 2
## year gdpPercap
## <int> <dbl>
## 1 1952 1084
## 2 1957 1073
## 3 1962 1074
## 4 1967 1136
## 5 1972 1213
## 6 1977 1349
Here we create a list of column names we want to pull out of the dataset using c()
which combines values into a vector or list. Confused? Don’t worry, things will get easier as we introduce dplyr
.
dplyr
The dplyr
package is one of a number of packages in the tidyverse
set of packages that makes data wrangling, indexing, and plotting much easier (and, dare I say, fun?). In this class, we’ll frequently use this set of packages. Yes, you can do the same things using base R
, but while they may seem a bit trickier initially, the tools in the tidyverse
are extremely powerful and worth learning. Not convinced? Check this out..
Instead of indexing with [,]
, $
and other symbols, dplyr
contains several functions that make data organization much simpler:
select()
: select columnsfilter()
: select rowsarrange()
: order or arrange rowsmutate()
: create new columnssummarize()
: summarize values (for the Brits in the room, you can also use summarise()
)group_by()
: group observationsOk, let’s try to create the same Sri Lanka dataset with year and GDP per capita using the dplyr
dataset. Don’t forget to load either the tidyverse
package which contains dplyr
(library(tidyverse)
) or just dplyr
using library(dplyr)
if you haven’t already:
sri_lanka_gdp <- gapminder %>%
filter(country == "Sri Lanka") %>%
select(year, gdpPercap)
head(sri_lanka_gdp)
## # A tibble: 6 x 2
## year gdpPercap
## <int> <dbl>
## 1 1952 1084
## 2 1957 1073
## 3 1962 1074
## 4 1967 1136
## 5 1972 1213
## 6 1977 1349
A few things are happening here. First, you’re probably wondering what that crazy %>%
thing is. I’ll get there. First, let’s look at the two functions I’m using to create the dataset. Since we only care about Sri Lanka, we start by filter
ing our the rows in which country=="Sri Lanka"
. We then select
the columns we’re interested in, year
and gdpPercap
. So what is the %>%
thing about? This is called a pipe. This allows you to “pipe” the output from one function to the input of another. In this example, we start with the full gapminder
dataset. We feed the full dataset into the the filter()
function, which selects only rows for Sri Lanka. We then feed this new Sri Lankan dataset into the select()
function to select only columns of interest to us. This keeps us from having to create two separate data.frames
or from complicated indexing (i.e. c("gdpPercap", etc)
). It also is very easy for other R
programmers to read because it reads like plain old English.
Not convinced yet that this trumps base R
? Ok, say you want to know the average, maximum, and minimum GDP for Sri Lanka over the last 50 years. No problem:
gapminder %>%
filter(country == "Sri Lanka") %>%
select(year, gdpPercap) %>%
summarize(avg_gdp = mean(gdpPercap), max_gdp = max(gdpPercap), min_gdp = min(gdpPercap))
## # A tibble: 1 x 3
## avg_gdp max_gdp min_gdp
## <dbl> <dbl> <dbl>
## 1 1855 3970 1073
The summarize()
function takes all rows in each columns and applies a function to these rows. mean(gdpPercap)
takes the mean of all observations of gdpPercap
for Sri Lanka and returns the average, summarized as the new variable avg_gdp
.
Let’s go a bit further. Let’s say you want to know which countries have the highest average life expectancy:
gapminder %>%
group_by(country) %>%
summarize(mean_le = mean(lifeExp)) %>%
arrange(desc(mean_le))
## # A tibble: 142 x 2
## country mean_le
## <fctr> <dbl>
## 1 Iceland 76.5
## 2 Sweden 76.2
## 3 Norway 75.8
## 4 Netherlands 75.6
## 5 Switzerland 75.6
## 6 Canada 74.9
## 7 Japan 74.8
## 8 Australia 74.7
## 9 Denmark 74.4
## 10 France 74.3
## # ... with 132 more rows
So you can expect Bjork to live forever, which is honestly great news. Notice the use of the group_by()
function. This is an important step. It groups the data by country (you could also group by year or any other categorical variable) and then applies the function specified in the summarize()
function to each group of data, in our case, to each country
. We use the function mean()
, but you can apply any function you can find (or build!) to this grouped data. I find this simpler and much easier to read than answering the same question using base R
, and this is why we as a class will invest time and energy in learning how to become tidyverse
masters. I should add, however, that I’m a big proponent of the you do you philosophy, so if you feel strongly attached to base R
and chose to use base R
to work with your data, you do you. I’ll also say that on big projects, you tend to use a bit of both, so be sure you’ve reviewed the base R
resources provided last week.
What if we want to add a new variable to our dataset, say an indicator of whether or not a country is located in the continent Africa. Enter the mutate()
variable which allows us to easily add new variables to our data.frame
.
africa <- gapminder %>%
mutate(africa = ifelse(continent == "Africa", 1, 0))
head(africa)
## # A tibble: 6 x 7
## country continent year lifeExp pop gdpPercap africa
## <fctr> <fctr> <int> <dbl> <int> <dbl> <dbl>
## 1 Afghanistan Asia 1952 28.8 8425333 779 0
## 2 Afghanistan Asia 1957 30.3 9240934 821 0
## 3 Afghanistan Asia 1962 32.0 10267083 853 0
## 4 Afghanistan Asia 1967 34.0 11537966 836 0
## 5 Afghanistan Asia 1972 36.1 13079460 740 0
## 6 Afghanistan Asia 1977 38.4 14880372 786 0
This creates a new data.frame
to which I’ve added a new variable called africa
that is equal to 1
if the observation is located in the continent of Africa and 0
if it is not. The ifelse()
function is quite useful, check it out using ?
. This has been your quick intro to dplyr
. For more examples of data wrangling and manipulation with dplyr
, I recommend this post. as well as the pre-assignment readings written by the dplyr
creator Hadley Wickham.
ggplot2
Manipulating data.frames
is all fine and good, but the fun (yes, I’m telling you, this stuff can be fun!) really starts when you start visualizing data. Yet again, the tidyverse
dominates with a powerful package called ggplot2
. ggplot2
makes it easy for you to create beautiful data visualizations. Check out the ggplot
gallery if you don’t believe me! This lab will give you a very short introduction to ggplot2
. We’ll spend more time on this package in the following weeks and eventually learn how to plot spatial data with ggplot2
(and other packages).
Let’s start by plotting data from a single country. Let’s say we’re interested in how life expectancy has changed from the 1950s to present in the United States. Well, with dplyr
it’s now easy for us to pull out data for the U.S. from our larger data.frame
and assign it to a new, smaller data.frame
called us
.
us <- gapminder %>%
filter(country == "United States")
Easy. Now to plot this. When plotting with ggplot
you start by calling the ggplot()
function. This creates a blank plot with a coordinate system that you can add data to. The first argument of the ggplot()
functions is the dataset you want to plot. In our case, this is the us
data.frame
we just built:
ggplot(data=us)
For now the plot is blank because we haven’t told ggplot2
how to deal with the data. We can add additional layers to the plot by using the +
symbol. Each layer provides more information about how we’d like to plot the data. Say we want to plot points indicating life expectancy through time. You can get a full list of the types of plots at this website under the Layer:geoms
section. For now, we’ll use the geom_point
function to plot points:
ggplot(data=us) +
geom_point(mapping = aes(x = year, y = lifeExp))
geom_point()
takes a mapping
argument in which we specify the aesthetics aes
and indicate which variable we’d like to plot on the x axis (year
) and which we’d like to plot on the y axis lifeExp
. I know, this isn’t the most elegant way to do this, but once you get past the mapping/aesthetic specifications, adding additional detail is very easy.
Let’s fix a few new details to our plot. It could use better axis labels and a clear title. It could also be nice to change the color of the points to make them stand out a bit more:
ggplot(data=us) +
geom_point(mapping = aes(x = year, y = lifeExp), color = "blue") +
ggtitle("Life expectancy in the U.S.") +
xlab("Year") +
ylab("Life expectancy")
Much better. Depending on what we want to plot, we can change the geometry object we use. If we want a smoothed line plot, we could use geom_smooth
:
ggplot(data=us) +
geom_smooth(mapping = aes(x = year, y = lifeExp), color = "blue") +
ggtitle("Life expectancy in the U.S.") +
xlab("Year") +
ylab("Life expectancy")
## `geom_smooth()` using method = 'loess'
Another cool trick is that we can pipe %>%
our dplyr
manipulations straight into a ggplot()
. Let’s make a life expectancy plot for Sierra Leone using this approach:
gapminder %>%
filter(country == "Sierra Leone") %>%
ggplot() +
geom_smooth(mapping = aes(x = year, y = lifeExp), color = "blue") +
ggtitle("Life expectancy in Sierra Leone") +
xlab("Year") +
ylab("Life expectancy")
## `geom_smooth()` using method = 'loess'
What if we wanted to compare life expectancy in the United States with life expectancy in Sierra Leone in a single plot?
gapminder %>%
filter(country %in% c("Sierra Leone", "United States")) %>%
ggplot() +
geom_smooth(mapping = aes(x = year, y = lifeExp, color = country)) +
ggtitle("Comparing life expectancy") +
xlab("Year") +
ylab("Life expectancy")
## `geom_smooth()` using method = 'loess'
If you want to filter using multiple criteria, use %in%
rather than ==
. This selects all rows with country
equal to the countries in the list we created using c()
. Since our data.frame
contains information from two countries, we can add an argument to the geom_smooth()
function that tells ggplot()
to group observations by country and to symbolize them using two different colors. We do this by adding the argument color=country
.
dplyr
. You can also do this in base R
and it’s often quite useful to know how to do this. I recommend the learnR tutorial or these tutorials on data subsetting, data manipulation. Make sure you’re familiar with how to index and wrangle data in base R
before we proceed!