ggplot
ggplot
to visualize your data.gapminder
dataset.ggplot
!I’m assuming you’re already familiar with the dplyr
tools discussed in the previous chapter. I’m also assuming you’ll be really impressed by ggplot
by the end of this tutorial.
We’ll be working with the gapminder
dataset again in this chapter, so make sure you’ve installed the dslabs
package and loaded it in your library. We’ll also need the tidyverse
, which contains our old friend dplyr
and our soon-to-be friend ggplot2
:
library(tidyverse)
library(dslabs)
data(gapminder)
Remember that our gapminder
dataset includes the following variables:
country
year
infant_mortality
or infant deaths per 1000life_expectancy
or life expectancyfertility
or the average number of children per womanpopulation
or country populationgdp
GDP according to the World Bankcontinent
region
or geographical regionLet’s imagine that we are a team of influential data scientists who have been asked by the United Nations to create visual summaries of global differences in population, life expectancy, and GDP per capita. The UN wants to use this information to highlight global inequality and to justify important funding decisions that will ultimately impact people’s lives. Let’s work together to create a couple of powerful visualizations that illustrate important differences through time and across space in these important indicators of well-being.
ggplot2
Manipulating data.frames
with dplyr
is all fine and good, but the fun (yes, I’m telling you, this stuff can be fun!) really starts when you visualize your data. Yet again, the tidyverse
dominates with a powerful package called ggplot2
. ggplot2
makes it easy for you to create beautiful data visualizations. Check out the ggplot
gallery if you don’t believe me! This lab will give you a very short introduction to ggplot2
. We’ll use this package quite a bit and eventually learn how to plot spatial data with ggplot2
, so pay attention!
Let’s start by plotting data from a single country. Let’s say we’re interested in visualizing how life expectancy has changed from the 1950s to present in the United States. Well, with dplyr
it’s now easy for us to pull out data for the U.S. from our larger data.frame
and assign it to a new, smaller data.frame
called us
.
us <- gapminder %>%
filter(country == "United States") # filter out rows where country is US
Easy! Now to plot this. When plotting with ggplot
you start by calling the ggplot()
function. This creates a blank plot to which you can add data. The first argument of the ggplot()
functions is the dataset you want to plot. In our case, this is the us
data.frame
we just built:
ggplot(data=us)
For now the plot is blank because we haven’t told ggplot2
how to deal with the data, i.e. what to put on the x and y axis, what type of graph to create (point, line, bar, etc). We can add additional layers to the plot by using the +
symbol. Each layer provides more information about how we’d like to plot the data. Say we want to plot points indicating life expectancy through time. We can use the geom_point
function to plot points:
ggplot(data=us) +
geom_point(mapping = aes(x = year, y = life_expectancy))
You can get a full list of the types of plots at this website under the Layer:geoms
section. We’ll work with quite a few in this class. The geom_point()
geometry function takes a mapping
argument in which we specify the aesthetics aes
and indicate which variable we’d like to plot on the x axis (year
) and which we’d like to plot on the y axis lifeExp
. I know, this isn’t the most elegant way to do this, but once you get past the mapping/aesthetic specifications, adding additional detail is very easy. Unlike dplyr
, which use pipes (%>%
), ggplot
uses layering, symbolized with a plus sign (+
). Let’s add a few new details to our plot to see how layering works. It could use better axis labels and a clear title. It could also be nice to change the color of the points to make them stand out a bit more:
ggplot(data=us) +
geom_point(mapping = aes(x = year, y = life_expectancy), color = "blue") +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy in the US")
Much better. Depending on what we want to plot, we can change the geometry object we use. If we want a smoothed line rather than dots, we could replace geom_point()
with geom_smooth()
:
ggplot(data=us) +
geom_smooth(mapping = aes(x = year, y = life_expectancy), color = "blue") +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy in the US")
Another cool trick is that we can pipe %>%
our dplyr
manipulations straight into a ggplot()
. Let’s try this with the U.S.
gapminder %>%
filter(country == "United States") %>%
ggplot() + # switch to layering with the plus sign once you call the ggplot() function
geom_smooth(mapping = aes(x = year, y = life_expectancy), color = "blue") +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy in the US")
If we wanted to make a similar plot for another country, we’d only need to change that small part of the code. Let’s make a life expectancy plot for Sierra Leone using this approach:
gapminder %>%
filter(country == "Sierra Leone") %>% # <-- just change country name!
ggplot() +
geom_smooth(mapping = aes(x = year, y = life_expectancy), color = "blue") +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy")
What if we wanted to compare life expectancy in the United States with life expectancy in Sierra Leone in a single plot?
gapminder %>%
filter(country %in% c("Sierra Leone", "United States")) %>%
ggplot() +
geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy")
First, look at the data and react to what this means in the real world. Life expectancy in the US has consistently remained 20 to 30 years higher than in Sierra Leone over the last half-century.
Second, remember, if you want to filter using multiple criteria, use %in%
rather than ==
. This selects all rows with country
equal to the countries in the list we created using c()
. Since our data.frame
contains information from two countries, we can add an argument to the geom_smooth()
function that tells ggplot()
to group observations by country and to symbolize them using two different colors (color=country
). Alternatively, we could add the argument group=country
, which would create two separate lines of the same color. Try it!
What if we wanted to add our original points to this visualization? We could just add back the geom_point()
geometry we used earlier:
gapminder %>%
filter(country %in% c("Sierra Leone", "United States")) %>%
ggplot() +
geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
geom_point(mapping = aes(x = year, y = life_expectancy)) +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy")
And what if we wanted to do something really cool like, say, change the size of the points to reflect the country’s population in each year?
gapminder %>%
filter(country %in% c("Sierra Leone", "United States")) %>%
ggplot() +
geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
geom_point(mapping = aes(x = year, y = life_expectancy, size = population)) +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy")
Every ggplot
you will ever make (including maps) follows a similar logic. Start by calling the ggplot()
function. Tell ggplot()
which dataset you want to visualize. Then slowly add (+
) the geometries and other information you want to visualize.
Go ahead and say it, that last viz was U-G-L-Y. In what follows, I’m going to show you some of my favorite tricks to improve the quality of visualizations made in ggplot
. Let’s revisit one of our plots above. I’m going to save this plot in an object (p
). This means if/as we want to change the plot, we simply have to call the plot and add additional layers! Much easier than re-copying the code a million times!
p <- gapminder %>%
filter(country %in% c("Sierra Leone", "United States")) %>%
ggplot() +
geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
geom_point(mapping = aes(x = year, y = life_expectancy)) +
labs(x = "Year",
y = "Life expectancy",
title = "Life expectancy")
p
Ok, there are a few things that could be improved here:
ggplot
gray square background.One of my favorite things about ggplot
is the ability to use themes. A theme sets the grid marks, grid color, axis label text and size, and other parameters. The default theme in ggplot
is OK, but I’m not a fan of the big gray background . Changing your theme is a very easy way to change your plot. Everything you ever wanted to know about themes can be found here but let’s play with a few quickly.
theme_bw()
remove the gray background:
p +
theme_bw()
p +
theme_minimal()
p +
theme_dark()
If you want to go theme-crazy (YES), install the ggthemes
package:
library(ggthemes)
p +
theme_tufte()
I love me some Tufte. To read more about his work, including the creation of the box and whisker plot we’ll learn about next week, check this out.
# The Economist
p +
theme_economist()
You can see all ggthemes here. You can also explore the many many ways you can amp up your ggplotting skills (animations anyone?) here.
Moving forward, I’ll go with the theme_bw()
. Let’s now shift to altering our ugly legend. Changing the legend title is as easy as:
p +
theme_bw() +
labs(color = "Country")
Note that if your legend refers to something other than color (size, alpha, fill), then you’d just use that aesthetic instead of color (so in the labs()
fuction, add an argument like fill = "NAME OF FILL VARIABLE"
). What if we just want to drop the legend title, since it’s fairly clear that we’re talking about countries here:
p +
theme_bw() +
theme(legend.title = element_blank())
Finally, if you’re a little OCD about your visualizations, you can drop the gray background in the legend with the following:
p +
theme_bw() +
theme(legend.title = element_blank()) +
guides(color=guide_legend(override.aes=list(fill=NA)))
Nice. Notice that if you look at ALL of the code used to make p
, it can be very overwhelming. IF, however, you think of this as a layered graphic and approach things (with some help from StackExchange) one layer at a time, you can easily make fairly complex graphics.
Now what if we want to remove the legend altogether?
p +
theme_bw() +
theme(legend.position="none")
You can read everything you ever wanted to know about ggplot
legends here and also here. This includes more on how to move the legend, change the background of the legend, and alter the elements in the legend.
Ok, let’s update our plot with the new legend that drops the “Country” title:
# this replaces the original p object with a new object with our legend edits
p <- p +
theme_bw() +
theme(legend.title = element_blank()) +
guides(color=guide_legend(override.aes=list(fill=NA)))
You can change everything about the axis labels, from font type to font size. This website provides a great overview of altering axes.
The basic formula for changing axis ticks is:
# x axis tick mark labels
p + theme(axis.text.x= element_text(family, face, colour, size))
# y axis tick mark labels
p + theme(axis.text.y = element_text(family, face, colour, size))
Where:
So imagine we want to increase the size of the numbers on the axis and tilt these numbers at 45 degrees. Let’s also make them blue using the hex-code for aqua.
p + theme(axis.text.x= element_text(face = "bold", colour = "#00FFFF", size = 14, angle = 45))
Ugly, but if we wanted to do the same for the y-axis, we’d just change axis.text.x
to axis.text.y
. We can also hide axis tick marks as follows:
p + theme(axis.text.x= element_blank())
If you review the material on this website you can also learn how to change axis lines, tick resolution, tick mark labels, and the order of items in your plot.
Here’s a full overview of how to edit the title and labels in ggplot
. I’ll overview some of the “best of” below. The basic formula for altering these elements of your plot is:
# main title
p + theme(plot.title = element_text(family, face, colour, size))
# x axis title
p + theme(axis.title.x = element_text(family, face, colour, size))
# y axis title
p + theme(axis.title.y = element_text(family, face, colour, size))
Where:
Here’s an example in which we make the title crazy:
p +
labs(title = "This. Plot. Is. On. Fireeee.") +
theme(plot.title = element_text(color = "orange", size = 20, face = "bold.italic"))
You can remove axis labels as follows:
p + theme(axis.title.y = element_blank(),
axis.title.x = element_blank())
Now, to actually edit the text content of titles and axis labels, I use the labs function:
p +
labs(x = "XAXIS LABEL HERE",
y = "YAXIS LABEL HERE",
title = "TITLE HERE",
subtitle = "Nice, subtitle here",
caption = "Oh wow, caption here!") # this is also where you can add legend titles for size, color, fill etc...
A great overview of color in R can be found here. We’ll focus mostly on playing with color using ggplot2
, but if you want to learn more about color manipulation using base R, check out this post.
What if we want to override the default color palette assigned to our box and whiskers by ggplot
? We can do this manually by selecting the appropriate hex-codes for colors in the plot:
p +
scale_color_manual(values = c("#6C443B", "#A93FD3"))
What’s that crazy #6C...
? That’s a HEX code. It’s a code that tells your computer the exact color you want to use. Of course, you can also use basic color words like “orange” or “yellow”, but sometimes HEX codes are more fun. I like this website for finding colors
If you’re like me and not very good at manually selecting colors, you can rely on a color palette already built in R
. One of the go-to packages for color manipulation in R is the RColorBrewer
package. Be sure it’s installed on your machine before you proceed!
library(RColorBrewer)
display.brewer.all()
The RColorBrewer
package contains three general types of palettes:
factors
)Once you find a palette you like, you can visualize it as follows:
display.brewer.pal(n=8, name="Dark2")
Here, the n
attribute is the number of discrete colors you want in the palette.
Let’s add a palette we dig to our plot:
p +
scale_color_brewer(palette = "Dark2")
This randomly selects two colors from that palette and applies it to our groups. The sky is the limit when it comes to colors and ggplot. For example, if you’re a big Wes Anderson fan, then try this:
library(wesanderson)
names(wes_palettes)
## [1] "BottleRocket1" "BottleRocket2" "Rushmore1" "Rushmore"
## [5] "Royal1" "Royal2" "Zissou1" "Darjeeling1"
## [9] "Darjeeling2" "Chevalier1" "FantasticFox1" "Moonrise1"
## [13] "Moonrise2" "Moonrise3" "Cavalcanti1" "GrandBudapest1"
## [17] "GrandBudapest2" "IsleofDogs1" "IsleofDogs2"
wes_palette("Darjeeling1")
If you’re a bit more boring and really want a gray-scale plot, that’s easy too:
p +
scale_color_grey()
In recent years, there has been growing awareness about using palettes that are both color blind friendly and that transfer well to gray-scale (i.e. when converted to black and white).
library(viridis)
p +
scale_color_viridis(discrete=T, option="magma")
Color options include magma
, plasma
, viridis
, inferno
, and cividis
.
The dichromat
package also contains color schemes suitable to folks who are color blind:
library(dichromat)
names(dichromat::colorschemes)
## [1] "BrowntoBlue.10" "BrowntoBlue.12" "BluetoDarkOrange.12"
## [4] "BluetoDarkOrange.18" "DarkRedtoBlue.12" "DarkRedtoBlue.18"
## [7] "BluetoGreen.14" "BluetoGray.8" "BluetoOrangeRed.14"
## [10] "BluetoOrange.10" "BluetoOrange.12" "BluetoOrange.8"
## [13] "LightBluetoDarkBlue.10" "LightBluetoDarkBlue.7" "Categorical.12"
## [16] "GreentoMagenta.16" "SteppedSequential.5"
I’ve found this color-blindness simulator helpful when thinking about palette choices.
Let’s wrap this all up in one lovely plot that includes some additional countries:
gapminder %>%
filter(country %in% c("Sierra Leone", "United States", "Italy", "Nigeria", "India")) %>%
ggplot() +
geom_smooth(mapping = aes(x = year, y = life_expectancy, color = country)) +
geom_point(mapping = aes(x = year, y = life_expectancy), alpha = 0.2) + # alpha makes points transparent
labs(x = "",
y = "Life expectancy",
title = "Life expectancy",
caption = "Source: https://www.cdc.gov/500cities/") +
theme_minimal() +
theme(legend.title = element_blank()) +
guides(color=guide_legend(override.aes=list(fill=NA))) +
scale_color_manual(values = wes_palette(n=5, name="Darjeeling1"))
When working with ggplot
, save figures using the ggsave()
function. Note that to use this function, you’ll need to dump your ggplot into a variable:
my_sweet_plot <- ggplot(data = my_data) +
geom_point(aes(x = YEAR, y = VALUE))
ggsave(my_sweet_plot, "./myfigure.png")
You can read more about writing your figures to a other types of files here.
Get as close as you can to re-creating this visualization from the gapminder
dataset in the dslabs
package: