dplyr
dplyr
to clean and wrangle your data.R
.gapminder
dataset.R
language. I will no longer explain basic concepts like data class, data mode, or variables versus observations.If you haven’t already installed the dslabs
package, type install.packages("dslabs")
in the Console to download the package off of the CRAN
repository onto your computer. Once this package is downloaded on your computer, you can pull it into an R session the library()
command. This packages contains a number of cool datasets. Today, we’ll play with the gapminder
dataset, which describes the demographic and socioeconomic attributes of countries around the world. You can learn more about the Gapminder project at www.gapminder.org.
library(dslabs) # load the dslabs package
data(gapminder) # pull the gapminder data into your RStudio session
The data()
function pulls the gapminder
dataset that was downloaded with the dslabs
package into your Global Environment (window to the right in RStudio). We’ll learn how to import other types of data in future tutorials.
You’ll also want to to install and load the tidyverse
package. This is actually a suite of packages that includes dplyr
for cleaning data (this chapter) and ggplot2
for visualizing data (next chapter). You can read more about the cool tools of the tidyverse
here.
library(tidyverse) # this loads a suite of packages including dplyr and ggplot2
dplyr
Let’s inspect the new gapminder
dataset.
One of the first things I tend to do when I load a new dataset into my RStudio session is to use the head()
function to look at the first few rows of the data. This lets me confirm that the data was, in fact, loaded correctly. Remember that functions are the verbs of the R programming language. Functions tend to take arguments, or inputs that give the verb a bit more information about how it should act. In our case, in order for the head()
function to “return the first or last parts of a vector, matrix, table, data frame or function”, it needs to know which data it should inspect:
head(gapminder)
## country year infant_mortality life_expectancy fertility
## 1 Albania 1960 115.40 62.87 6.19
## 2 Algeria 1960 148.20 47.50 7.65
## 3 Angola 1960 208.00 35.98 7.32
## 4 Antigua and Barbuda 1960 NA 62.97 4.43
## 5 Argentina 1960 59.87 65.39 3.11
## 6 Armenia 1960 NA 66.86 4.55
## population gdp continent region
## 1 1636054 NA Europe Southern Europe
## 2 11124892 13828152297 Africa Northern Africa
## 3 5270844 NA Africa Middle Africa
## 4 54681 NA Americas Caribbean
## 5 20619075 108322326649 Americas South America
## 6 1867396 NA Asia Western Asia
The output of this line of code is simply the first six lines of the gapminder
dataset. Note that you can also type View(gapminder)
in the Console to see the full dataset in a new window (kinda’ like what you’d expect in Excel). You can also look at the dataset by clicking on the blue arrow to the left of the dataset in the Global Environment window.
Visually inspecting the data is useful because it lets us quickly confirm that our data was loaded correctly. For example, we can now see that our gapminder
dataset includes the following columns:
country
year
infant_mortality
or infant deaths per 1000life_expectancy
or life expectancyfertility
or the average number of children per womanpopulation
or country populationgdp
GDP according to the World Bankcontinent
region
or geographical regionEach of these columns represents a different variable, in this case information describing the year and attributes of different countries. We refer to each row in the table as an observation, or a discrete country-year instance in the data. We can ask R programmatically how many rows and columns are in the dataset using the following functions:
dim(gapminder) # dimensions, returned in row, column format
## [1] 10545 9
nrow(gapminder) # number of rows
## [1] 10545
ncol(gapminder) # number of columns
## [1] 9
Now let’s determine the class
of each variable. We could do this individually for each variable by typing class(gapminder$variable_name)
, or we could use the sapply()
function to apply the class
function across all variables in the gapminder
dataset. Remember, if you’re unfamiliar with any function, i.e. sapply()
, you can ask R
how it works using a question mark, i.e. ?sapply()
.
sapply(gapminder, class)
## country year infant_mortality life_expectancy
## "factor" "integer" "numeric" "numeric"
## fertility population gdp continent
## "numeric" "numeric" "numeric" "factor"
## region
## "factor"
Our categorical variables (country
, continent
, and region
), or variables that take on a number of limited possible values, are coded as "factor"
variables. Our other variables are coded as numeric
(continuous numerical data) or integer
(discrete valued numerical data, like year
).
We can extract information for each variable in the dataset using $
. For example, if we wanted to determine the range of years in the dataset we can simply type:
range(gapminder$year)
## [1] 1960 2016
What if we wanted to generate a list of the unique regions in the dataset? Simply printing out the vector gapminder$region
would be insufficient since regions are repeated over multiple years and countries. Type gapminder$region
in the Console to see what I mean! One of the functions I use the most when checking data is the unique()
function. This function returns a vector of unique values in a larger vector. Check out this example:
x <- c(1, 1, 1, 2, 3, 4, 5)
unique(x)
## [1] 1 2 3 4 5
In the case of our gapminder
data, this function could be used to generate a list of the unique regions:
unique(gapminder$region)
## [1] Southern Europe Northern Africa
## [3] Middle Africa Caribbean
## [5] South America Western Asia
## [7] Australia and New Zealand Western Europe
## [9] Southern Asia Eastern Europe
## [11] Central America Western Africa
## [13] Southern Africa South-Eastern Asia
## [15] Eastern Africa Northern America
## [17] Eastern Asia Northern Europe
## [19] Melanesia Polynesia
## [21] Central Asia Micronesia
## 22 Levels: Australia and New Zealand Caribbean Central America ... Western Europe
Nice! Another thing we often want to do with our data is compute descriptive statistics like the mean, median, mode, or standard deviation. This isn’t hard in R. For example, what if we want to know the average life expectancy across time and across all countries?
mean(gapminder$life_expectancy)
## [1] 64.81162
This means that over the last 50 years, and across all countries, the average life expectancy is 64.8116226 years.
How about the mean rate of infant mortality?
mean(gapminder$infant_mortality)
## [1] NA
Wutttt!?
NA
means “Not Available” and is typically used to encode missing data, or data that, for whatever reason, is not available in your dataset. It is very important to understand why data is missing and to think through the implications for any visualizations, descriptive statistics, or analyses you conduct with the data. For example, imagine that in several countries, a severe drought causes widespread famine. During this crisis, the countries are unable to report national health statistics. The national data you are analyzing may just list NA
for these countries, essentially dropping them from any analyses you conduct or visualizations you create. The reality, however, is that humans continued to live in these countries and experienced very real health outcomes, likely lower than global averages due to the famine. As a data scientist, it is imperative that you locate and understand the implications of missing data in your dataset, so that as you transform your data into information to inform decision-making, you can do it in a way that is honest.
With that important caveat stated, I’m going to show you how to programatically ignore missing data so that you can still compute descriptive statistics on the data you do have. Let’s start with a simple example. Say you have a simple vector x
with the following values:
x <- c(1, NA, 3, 4, 5)
For whatever reason, the second element of this vector is missing, or NA
. If I try to compute the mean of this vector using the mean()
function, here’s what happens:
mean(x)
## [1] NA
Anytime a vector contains missing data, most R
functions will return a NA
. This can be annoying, yes, but it’s actually R
’s helpful reminder that “hey man, don’t forget you’ve got missing data there!” Let’s assume you totally understand why the second element is missing and the implications of this missing element for your analysis. Nice. In that case, you can go ahead and force R to ignore the missing data by adding the na.rm = T
argument:
mean(x, na.rm = T)
## [1] 3.25
This second argument to the mean()
function overwrites the default value of na.rm = F
(here T
stands for TRUE
and F
stands for FALSE
). By turning this argument on, you’re essentially saying “R
, please remove the NA
values and then compute the mean.”
Ok, so how does all of this apply to our gapminder
example. We can use the same argument to tell the mean()
function to ignore the NA
values in our infant_mortality
vector:
mean(gapminder$infant_mortality, na.rm = T)
## [1] 55.30862
This means that the global average (after accounting for missing data) for infant deaths over the last ~60 years per 1,000 births is 55.3086188. Two things might stand out here. One, is that we are talking about infant mortality… this number has extremely real implications in the real world. This is why I always want you to take a step back and think about what the numbers and visualization you generate actually mean for people and planet. Your second reaction might be, huh, this is interesting, but it would be more interesting if I could zoom in and look at differences across countries and through time. Well my friend, get ready for the tidyverse
:
dplyr
The dplyr
package is one of a number of packages in the tidyverse
set of packages that makes data wrangling, indexing, and plotting much, much easier than with base R tools (and, dare I say, fun?).
The dplyr
package contains several functions (think: verbs) that make querying your data much simpler:
select()
: select columnsfilter()
: select rowsarrange()
: order or arrange rowsmutate()
: create new columnssummarize()
: summarize values (for the Brits in the room, you can also use summarise()
)group_by()
: group observationsWhat I love most about dplyr
is that the functions are so intuitive, it often feels like I’m having a conversation with my data:
“Hey dplyr
, tell me what you know about infant mortality in Sri Lanka in 2000!”
# you got it
gapminder %>%
filter(country == "Sri Lanka", year == 2000) %>%
select(infant_mortality)
## infant_mortality
## 1 14
“You know what, I changed my mind, I’d rather know about the United States in the same year…”
# easy game
gapminder %>%
filter(country == "United States", year == 2000) %>%
select(infant_mortality)
## infant_mortality
## 1 7.1
“Scratch that, I actually want to know about Belgium, France, Morocco, and Nigeria… all at once please!”
# you're starting to get on my nerves, but sure, fine
gapminder %>%
filter(country %in% c("Belgium", "France", "Morocco", "Nigeria"), year == 2000) %>%
select(country, infant_mortality)
## country infant_mortality
## 1 Belgium 4.8
## 2 France 4.4
## 3 Morocco 42.2
## 4 Nigeria 112.0
A few things are happening here. First, you’re probably wondering what that crazy %>%
thing is. This is called a pipe. This allows you to “pipe” the output from one function into another function. In the first example, we start with the full gapminder
dataset. We feed the full dataset into the the filter()
function. This function filter()
s out the rows in which some condition is true, in our cases where country == "Sri Lanka"
or where the country is Sri Lanka and where year == 2000
or the year is 2000. Why the double equals sign? This is important. In R, one equals sign (=
) assigns value as in:
x = 1
print(x)
## [1] 1
A double equals sign tests whether something is true, so:
x == 1
## [1] TRUE
x == 2
## [1] FALSE
In our filter()
function, we want to filter()
out the rows where country == "Sri Lanka"
and year == 2000
.1 Once we filter down our dataset, we can use select to pull out the columns of interest to us (infant_mortality
).
How about that last example with four countries? Here, all I did was say, hey filter()
, find all rows where country
is %in%
this list of countries. I could also to the opposite, so filter out all rows where a condition is not true:
gapminder %>%
filter(country != "Sri Lanka")
Or all countries that are not in a list:
gapminder %>%
filter(!country %in% c("Belgium", "France", "Morocco", "Nigeria"))
The exclamation point here reads like the word “not”, so it’s like saying “hey dplyr
, filter countries NOT in this list or NOT equal to this.” We’ll come back to this in a sec…
Not convinced yet that dplyr
trumps base R
? OK, say you want to know the average, maximum, and minimum GDP for Sri Lanka over the last 50 years. No problem:
gapminder %>%
filter(country == "Sri Lanka") %>%
select(year, gdp) %>%
summarize(avg_gdp = mean(gdp),
max_gdp = max(gdp),
min_gdp = min(gdp))
## avg_gdp max_gdp min_gdp
## 1 NA NA NA
Whoops! This means there’s missing data. We can fix this using the same na.rm = T
argument:
gapminder %>%
filter(country == "Sri Lanka") %>%
select(year, gdp) %>%
summarize(avg_gdp = mean(gdp, na.rm = T),
max_gdp = max(gdp, na.rm = T),
min_gdp = min(gdp, na.rm = T))
## avg_gdp max_gdp min_gdp
## 1 10425011328 29260877188 2708601390
The summarize()
function takes all rows in each columns and applies a function to these rows. mean(gdp)
takes the mean of all observations of gdp
for Sri Lanka and returns the average, summarized as the new variable avg_gdp
.
Feeling confused? Great! Let’s go through each of the main dplyr
functions in a bit more detail to make sure you understand how they work.
filter()
What if we’re interested in some of the countries in the EU in which people speak French (oui oui). We can no longer use the format we used above for one country, Sri Lanka, where we simply used filter(country == "Sri Lanka")
. Now we need to filter rows that belong to a list of countries. We can do this as follows:
francophone <- gapminder %>%
filter(country %in% c("France", "Belgium", "Switzerland")) # ok, I forgot a few small countries
head(francophone)
## country year infant_mortality life_expectancy fertility population
## 1 Belgium 1960 29.5 69.59 2.60 9140563
## 2 France 1960 23.7 70.49 2.77 45865699
## 3 Switzerland 1960 21.6 71.46 2.52 5296120
## 4 Belgium 1961 28.1 70.46 2.63 9200393
## 5 France 1961 22.4 71.07 2.80 46471083
## 6 Switzerland 1961 21.2 71.79 2.55 5393411
## gdp continent region
## 1 68236665814 Europe Western Europe
## 2 349778187326 Europe Western Europe
## 3 NA Europe Western Europe
## 4 71634993490 Europe Western Europe
## 5 369037927246 Europe Western Europe
## 6 NA Europe Western Europe
Nice! You could confirm your filter()
worked as planned using good ol’ unique()
:
unique(francophone$country)
## [1] Belgium France Switzerland
## 185 Levels: Albania Algeria Angola Antigua and Barbuda Argentina ... Zimbabwe
Nailed it. If we want to expand La Francophonie, we simply add more countries to our list:
francophone <- gapminder %>%
filter(country %in% c("Belgium", "France", "Switzerland", "Morocco", "Madagascar"))
unique(francophone$country)
## [1] Belgium France Madagascar Morocco Switzerland
## 185 Levels: Albania Algeria Angola Antigua and Barbuda Argentina ... Zimbabwe
Another nice trick is to know how to filter by removing a country, as we saw above. Say, for example, we want to create a dataset with all countries except for Brazil (desculpa), we could do the following:
not_brazil <- gapminder %>%
filter(country != "Brazil")
The !=
symbols mean “not equal to.” What if we want to filter out countries in a longer list:
not_some_countries <- gapminder %>%
filter(!country %in% c("Brazil", "Spain", "Mexico"))
To do this, you put an !
in front of the variable on which you are filtering. This reads something like “not country in Brazil, Spain, and Mexico.” This is a fantastic summary (from the R for Data Science text) of how you can use logical operators to subset and filter your data:
So for example, you could try to filter out rows where both condition 1 and condition 2 are true using &
:
gapminder %>%
filter(year == 2016) %>%
filter(region == "Western Asia" & life_expectancy > 75) # note that you can also just use a comma here instead of &
Here we have the list of countries in Western Asia in 2016 where life expectancy was above 75 years.
We can also filter out rows where condition 1 or condition 2 are true using |
:
gapminder %>%
filter(year == 2016) %>%
filter(region == "Western Asia" | life_expectancy > 75)
This is a weird result, as it shows countries that are either in Western Asia or that have a life expectancy above 75.
Finally, we can also use dplyr
to remove missing data from our data.frame
:
gapminder %>%
filter(!is.na(life_expectancy))
filter()
dropped all rows where life_expectancy
is not (shown by the !
) equal to NA
. The is.na()
function is a good way to test for missing data. It returns a set of logical TRUE
, FALSE
indicators of whether each observation is flagged as NA
:
x <- c(1, 2, 3)
is.na(x)
## [1] FALSE FALSE FALSE
y <- c(NA, 2, 3)
is.na(y)
## [1] TRUE FALSE FALSE
select()
Select simply selects the columns you’re interested in working with. This can be useful when you’re working with big datasets with lots of variables (columns). If you wanted to create a new dataset with only the variables lifeExp
, year
, and country
, you would do the following:
le <- gapminder %>%
select(life_expectancy, year, country)
head(le)
## life_expectancy year country
## 1 62.87 1960 Albania
## 2 47.50 1960 Algeria
## 3 35.98 1960 Angola
## 4 62.97 1960 Antigua and Barbuda
## 5 65.39 1960 Argentina
## 6 66.86 1960 Armenia
mutate()
What if we want to add a new column that divides gdp
by population
to generate an estimate of GDP per capita? mutate()
can be used to create new variables (columns) that are the result of operations (+
, *
, -
, etc) on columns already in the dataset:
gapminder %>%
mutate(gdp_pc = gdp/population)
In the mutate()
function, after you point to the data.frame
you want to mutate (gapminder
in our case), you need to create a new variable (gdp_pc
), and then describe the operations you want to perform (division here). This adds a new column to the gapminder
data.frame
called gdp_pc
. Look over in your Environment (top right). Does the gapminder
variable include this new column? Nope! To update the gapminder
data.frame
, you either need to overwrite your existing gapminder
dataset (careful this replaces the original) or create a new data.frame
with the the new gdp_pc
column:
gapminder <- gapminder %>%
mutate(gdp_pc = gdp/population) # replaces original gapminder data.frame, so be careful!
new_gm <- gapminder %>%
mutate(gdp_pc = gdp/population) # better than overwriting the original!
If you only want to keep the new variables, use transmute()
:
transmute_example <- gapminder %>% transmute(gdp_pc = gdp/population)
head(transmute_example)
## gdp_pc
## 1 NA
## 2 1242.992
## 3 NA
## 4 NA
## 5 5253.501
## 6 NA
How about a more complicated example? What if we want to add a new variable to our dataset, say an indicator of whether or not a country is located in the continent Africa. Here’s where mutate()
really shines:
africa <- gapminder %>%
mutate(africa = ifelse(continent == "Africa", 1, 0))
glimpse(africa)
## Rows: 10,545
## Columns: 10
## $ country <fct> Albania, Algeria, Angola, Antigua and Barbuda, Arg...
## $ year <int> 1960, 1960, 1960, 1960, 1960, 1960, 1960, 1960, 19...
## $ infant_mortality <dbl> 115.40, 148.20, 208.00, NA, 59.87, NA, NA, 20.30, ...
## $ life_expectancy <dbl> 62.87, 47.50, 35.98, 62.97, 65.39, 66.86, 65.66, 7...
## $ fertility <dbl> 6.19, 7.65, 7.32, 4.43, 3.11, 4.55, 4.82, 3.45, 2....
## $ population <dbl> 1636054, 11124892, 5270844, 54681, 20619075, 18673...
## $ gdp <dbl> NA, 13828152297, NA, NA, 108322326649, NA, NA, 966...
## $ continent <fct> Europe, Africa, Africa, Americas, Americas, Asia, ...
## $ region <fct> Southern Europe, Northern Africa, Middle Africa, C...
## $ africa <dbl> 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,...
This creates a new data.frame
called africa
to which I’ve added a new variable called africa
that is equal to 1
if the observation is located in the continent of Africa and 0
if it is not. The ifelse()
function is quite useful, check it out using ?
. Also, note glimpse()
. This is the tidyverse
response to head()
… I actually like the glimpse()
function much more and tend to use it when I inspect data. It lets you all of column names in one shot as well as other important info like variable class and dimensions.
arrange()
What if we wanted to sort the countries in our gapminder
dataset in descending alaphabetical order, starting with Albania? arrange()
can do that! arrange()
defaults to sorting alphabetically and/or in increasing order numerically.
gapminder %>%
arrange(country)
If you want to sort in descending alphabetical order (so start with Zimbabwe), you have to add desc()
:
gapminder %>%
arrange(desc(country))
group_by()
and summarize()
What if you want to compute the average life expectancy for each country in the dataset, e.g. summarize data by type? summarize()
allows you to apply functions like mean()
to a column (or groups of data as we’ll see below) and return a single value:
gapminder %>%
summarize(mean(life_expectancy))
## mean(life_expectancy)
## 1 64.81162
What if we want to return average life expectancy for each country? We can use the group_by()
function to group data by country, then the summarize()
function to summarize the average life expectancy for each group of country data. To do this, we need to use multiple dplyr
functions in one go. We can do this using a pipe, which looks like this: %>%
mle <- gapminder %>%
group_by(country) %>%
summarize(mean(life_expectancy))
head(mle)
## # A tibble: 6 x 2
## country `mean(life_expectancy)`
## <fct> <dbl>
## 1 Albania 72.3
## 2 Algeria 65.0
## 3 Angola 48.4
## 4 Antigua and Barbuda 71.5
## 5 Argentina 71.4
## 6 Armenia 71.3
This returns a new data.frame
that lists each country and the mean(lifeExp)
for each country. If we want a nicer column name for this new average life expectancy data, we can specify this in the summarize()
function:
mle2 <- gapminder %>%
group_by(country) %>%
summarize(mean_le = mean(life_expectancy), n = n())
head(mle2)
## # A tibble: 6 x 3
## country mean_le n
## <fct> <dbl> <int>
## 1 Albania 72.3 57
## 2 Algeria 65.0 57
## 3 Angola 48.4 57
## 4 Antigua and Barbuda 71.5 57
## 5 Argentina 71.4 57
## 6 Armenia 71.3 57
I also added the n()
function that just returns the number of observations in each group (in our case, the number of years observed for each country).
Did you notice that we didn’t have to list gapminder
in the summarize function like we did above? This is because the pipe (%>%
) feeds the data produced by group_by()
into the summarize function. This way, we can write really long pipes that feed data through a complex process without having to specify the dataset at each step. Let’s work through some examples.
As with any language, once you know a few core verbs, you can get pretty far. Now that you are familiar with filter()
, select()
, mutate()
, arrange()
, group_by()
, and summarize()
, let’s try to have some basic conversations with the gapminder
dataset that will help us understand global human development.
gdp
? Which has had the lowest?# highest GDP
gapminder %>%
filter(year %in% 2000:2016) %>% # 2000:2016 generates a vector of numbers from 2000 to 2016
group_by(country) %>%
summarize(avg_gdp = mean(gdp, na.rm = T)) %>%
arrange(desc(avg_gdp)) %>%
filter(row_number() == 1) # cool little trick that lets you pull out the first row, in our case, highest GDP
## # A tibble: 1 x 2
## country avg_gdp
## <fct> <dbl>
## 1 United States 1.10e13
# lowest GDP, note that to do this, I just dropped the desc() function in arrange()
gapminder %>%
filter(year %in% 2000:2016) %>%
group_by(country) %>%
summarize(avg_gdp = mean(gdp, na.rm = T)) %>%
arrange(avg_gdp) %>%
filter(row_number() == 1)
## # A tibble: 1 x 2
## country avg_gdp
## <fct> <dbl>
## 1 Kiribati 73446284.
(Programming note: What’s cool here is that I can just copy-paste the code from above and add a mutate()
function that creates our new gdp_pc
variable)
# highest GDP per capita
gapminder %>%
mutate(gdp_pc = gdp/population) %>%
filter(year %in% 2000:2016) %>%
group_by(country) %>%
summarize(avg_gdp_pc = mean(gdp_pc, na.rm = T)) %>%
arrange(desc(avg_gdp_pc)) %>%
filter(row_number() == 1)
## # A tibble: 1 x 2
## country avg_gdp_pc
## <fct> <dbl>
## 1 Luxembourg 51576.
# lowest GDP per capita
gapminder %>%
mutate(gdp_pc = gdp/population) %>%
filter(year %in% 2000:2016) %>%
group_by(country) %>%
summarize(avg_gdp_pc = mean(gdp_pc, na.rm = T)) %>%
arrange(avg_gdp_pc) %>%
filter(row_number() == 1)
## # A tibble: 1 x 2
## country avg_gdp_pc
## <fct> <dbl>
## 1 Congo, Dem. Rep. 95.7
Take a second to react to those very different numbers.
gapminder %>%
filter(year >= 2000) %>% # greater than or equal to
group_by(region) %>%
summarize(mle = mean(life_expectancy, na.rm = T)) %>%
arrange(desc(mle)) %>%
filter(row_number() == 1)
## # A tibble: 1 x 2
## region mle
## <fct> <dbl>
## 1 Australia and New Zealand 80.8
gapminder %>%
filter(year >= 2000) %>% # greater than or equal to
group_by(region) %>%
summarize(mle = mean(life_expectancy, na.rm = T)) %>%
arrange(mle) %>%
filter(row_number() == 1)
## # A tibble: 1 x 2
## region mle
## <fct> <dbl>
## 1 Southern Africa 52.0
gapminder %>%
filter(year > 2000) %>%
group_by(year) %>%
summarize(mfr = mean(fertility, na.rm = T))
## # A tibble: 16 x 2
## year mfr
## <int> <dbl>
## 1 2001 3.18
## 2 2002 3.14
## 3 2003 3.09
## 4 2004 3.06
## 5 2005 3.02
## 6 2006 2.99
## 7 2007 2.97
## 8 2008 2.94
## 9 2009 2.91
## 10 2010 2.89
## 11 2011 2.85
## 12 2012 2.82
## 13 2013 2.80
## 14 2014 2.77
## 15 2015 2.74
## 16 2016 NaN
I hope you’re starting to get a sense of how powerful these tools are. Well get ready… looking at tables of numbers is fun and all, but nothing like visualizing data with graphs. See, for example, a tidyverse
-generated visualization that could help us understand question 4:
dplyr
. You can also do this in base R
and it’s often quite useful to know how to do this. I recommend the learnR tutorial or these tutorials on data subsetting, data manipulation. Make sure you’re familiar with how to index and wrangle data in base R
before we proceed!dplyr
, I recommend this post as well as the pre-assignment readings written by the dplyr
creator Hadley Wickham.Note that since country
is a character
vector, “Sri Lanka” is in quotes. year
, on the other hand, is an integer
vector, so we can just write out the number without quotes.↩