ggplot2
ggplot2
.ggplot2
look awesome.Make sure the tidyverse
suite of packages are installed on your machine and loaded in your R
session. The tidyverse
includes both the dplyr
and ggplot2
packages.
library(tidyverse)
You may also need:
library(RColorBrewer)
library(wesanderson)
library(viridis)
library(dichromat)
library(ggthemes)
Using your new dplyr
skills to tidy your data prior to graphing will make the whole graphing process significantly easier. Make sure all of your variables live together peacefully in a tidy data.frame
before you start plotting! Today, I’ve done the tidying for you, but in the future, you’ll need to be sure to clean things up before you dive into visualization.
This week we will work with the 500 Cities dataset. Here’s a brief introduction to this data:
“The 500 Cities project is a collaboration between CDC, the Robert Wood Johnson Foundation, and the CDC Foundation. The purpose of the 500 Cities Project is to provide city- and census tract-level small area estimates for chronic disease risk factors, health outcomes, and clinical preventive service use for the largest 500 cities in the United States. These small area estimates will allow cities and local health departments to better understand the burden and geographic distribution of health-related variables in their jurisdictions, and assist them in planning public health interventions.”
The project website has fantastic mapping and visualization tools, so go check it out. You can find specific information about each of the health variables in the dataset here. I’ve cleaned and subset the original large dataset, and this clean version is what we will work with today. First, let’s load the data. I’ve already tidyed the data and saved the data as a .RDS
file using saveRDS()
. RDS
files are an R-specific file structure that stores objects just as they were in R
- i.e., if the variable you write out is a data.frame
, then the variable you read back in will also be a data.frame
. Let’s load the .RDS
file using readRDS()
and use the glimpse()
function from the tidyverse
set of packages to inspect this dataset:
health <- readRDS("./data/health.RDS")
glimpse(health)
## Observations: 500
## Variables: 10
## $ CityName <fct> Abilene, Akron, Alameda, Albany, Albany, Albu...
## $ StateAbbr <fct> TX, OH, CA, GA, NY, NM, VA, CA, TX, PA, TX, C...
## $ PopulationCount <int> 117063, 199110, 73812, 77434, 97856, 545852, ...
## $ BingeDrinking <dbl> 16.2, 14.8, 15.0, 10.9, 15.5, 14.5, 15.1, 12....
## $ Smoking <dbl> 19.6, 26.8, 11.9, 23.7, 19.0, 18.8, 13.0, 12....
## $ MentalHealth <dbl> 11.6, 15.3, 9.8, 16.2, 13.2, 11.6, 8.4, 10.1,...
## $ PhysicalActivity <dbl> 27.7, 31.0, 18.7, 33.1, 26.1, 20.4, 17.6, 24....
## $ Obesity <dbl> 33.7, 37.3, 18.7, 40.4, 31.1, 25.5, 23.3, 18....
## $ PoorHealth <dbl> 12.6, 15.5, 9.6, 17.4, 13.1, 12.1, 8.4, 11.4,...
## $ LowSleep <dbl> 35.4, 44.1, 32.3, 46.9, 39.7, 32.8, 34.5, 38....
Here’s some additional information about each of the variables in the dataset:
CityName
is the name of each of the 500 cities. Note that some names repeat (e.g. Albany, NY and Albany, GA)StateAbbr
is the abbreviation for the state in which the city is located, including Washington D.C.PopulationCount
is the total population in 2014 for each city.BingeDrinking
is the rate of binge drinking among adults aged >= 18 years.Smoking
is the rate of current smoking among adults aged >= 18 years.MentalHealth
is the rate of low mental health for more than 14 days among adults aged >= 18 years.PhysicalActivity
is the rate of low physical activity among adults aged >= 18 years.Obesity
is the rate of obesity among adults aged >= 18 years.PoorHealth
is the rate of low physical health for more than 14 days among adults aged >= 18 years.LowSleep
is the rate of sleeping less than 7 hours among adults aged >= 18 years.Before we dive into visualizing the “center” of variables using ggplot2
, let’s review how we compute statistics of center in R
. Say we want to compute the mean, median, and mode of the vector x
I have created below:
x <- c(1, 2, 2, 3, 4, 5, 5, 5, 6)
R
makes finding a vector’s mean, median, and mode easy:
mean(x)
## [1] 3.666667
median(x)
## [1] 4
mode(x)
## [1] "numeric"
Wait, why does the mode()
function return "numeric"
? Type ?mode
in the console. The mode()
function tells us the storage mode of an object (i.e. "interger", "double", "character"
). Let’s make our own function to compute the mode of a vector:
compute_mode <- function(v) {
uniq <- unique(v) # identifies unique values of x
uniq[which.max(tabulate(match(v, uniq)))] # counts the number of instances of each value
}
compute_mode(x)
## [1] 5
That’s better! Note that if a variable contains missing data…
x_NA <- c(1, 2, 2, 3, 4, 5, 5, 5, 6, NA)
mean(x_NA)
## [1] NA
median(x_NA)
## [1] NA
…then mean()
and median()
will return NA
. This tells us that our variable contains missing data. We can tell R
to compute these statistics while ignoring the missing data by adding the argument na.rm=T
to the mean()
and median()` function.
mean(x, na.rm=T)
## [1] 3.666667
median(x, na.rm=T)
## [1] 4
Our compute_mode()
function will only return NA
when NA
is the most frequently occurring value in our variable:
compute_mode(x_NA)
## [1] 5
x_MEGA_NA <- c(1, 2, 2, 3, 4, NA, NA, NA, 6)
compute_mode(x_MEGA_NA)
## [1] NA
Let’s take a second to discuss when and why you should use each of these statistics. Mean is a non-resistant statistic, which means it is easily influenced by extreme values, or outliers. Median is a resistant statistic, so it is a better choice when you have extreme values in your data. First, let’s calculate mean and median without extreme values:
x <- c(1, 2, 2, 3, 4, 5, 5, 5, 6)
print(paste("Here's the mean with no outliers:", mean(x)))
## [1] "Here's the mean with no outliers: 3.66666666666667"
print(paste("Here's the median with no outliers:", median(x)))
## [1] "Here's the median with no outliers: 4"
Mean and median aren’t that different. And now with an extreme value of 60
added to the vector:
x <- c(1, 2, 2, 3, 4, 5, 5, 5, 60)
print(paste("Here's the mean with an outlier:", mean(x)))
## [1] "Here's the mean with an outlier: 9.66666666666667"
print(paste("Here's the median with an outlier:", median(x)))
## [1] "Here's the median with an outlier: 4"
The high value of 60
pulls up the mean of the dataset, however the mode is unchanged.
We can use these same functions to take the mean, median, and mode or larger variables, such as the columns in our health
dataset. Say, for example, you want to know the mean, median, and mode of populations in the 500 cities in the dataset:
print(paste("Mean: ", mean(health$PopulationCount)))
## [1] "Mean: 206041.616"
print(paste("Median: ", median(health$PopulationCount)))
## [1] "Median: 106106"
print(paste("Mode: ", compute_mode(health$PopulationCount)))
## [1] "Mode: 106433"
Note that the mean is much higher than the median and mode. This is because of several high population outliers in our dataset. We can quickly spot these outliers by looking at the top cities in terms of population:
health %>%
arrange(desc(PopulationCount)) %>%
slice(1:10) # this extracts the first 10 rows
## CityName StateAbbr PopulationCount BingeDrinking Smoking
## 1 New York NY 8175133 15.5 16.8
## 2 Los Angeles CA 3792621 15.1 15.6
## 3 Chicago IL 2695598 18.7 19.1
## 4 Houston TX 2099451 14.9 16.4
## 5 Philadelphia PA 1526006 17.5 25.5
## 6 Phoenix AZ 1445632 14.5 19.6
## 7 San Antonio TX 1327407 15.7 15.2
## 8 San Diego CA 1307402 17.3 13.1
## 9 Dallas TX 1197816 15.2 17.9
## 10 Honolulu HI 953207 19.0 15.5
## MentalHealth PhysicalActivity Obesity PoorHealth LowSleep
## 1 12.6 28.8 26.3 13.1 41.0
## 2 13.0 25.1 26.7 14.7 37.5
## 3 12.1 27.6 31.3 13.5 37.4
## 4 11.4 29.7 33.9 12.9 36.0
## 5 15.4 29.0 33.9 15.2 44.3
## 6 13.1 24.1 30.1 13.8 35.6
## 7 10.8 29.5 32.9 12.9 36.8
## 8 11.0 21.0 22.2 11.0 34.6
## 9 11.9 30.5 34.3 13.8 34.9
## 10 8.9 22.2 22.1 8.8 48.0
High population cities like New York pull up the mean of the PopulationCount
variable.
We can create visualizations that, say, plot the mean of key variables across states, but this collapses multiple observations (i.e. several cities within a state) to a single value. It does not visualize the variability (or spread) of the observations within each state. Let’s look at some cool ways to visualize both the center and the spread of variables…
Once we know the central tendency of each variable, it’s good to explore how far the other values of the variable are from the central value. This is known as the spread of the data. The simplest indicator of the spread of a variable is the range, or the difference between the maximum and minimum values of the variable. In the example above with our vector x
, the range would be the highest value, 6
minus the lowest value, 1
, or 5
.
Let’s plot the range of the variable LowSleep
in Utah, New York, California, and Tennessee against the average value of LowSleep
across all 500 cities:
health %>%
filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "WY", "FL")) %>%
ggplot() + # pipe filtered data into ggplot() function
geom_point(aes(x = StateAbbr, y = LowSleep), size = 2, alpha = 0.4) + # add point geom
geom_hline(yintercept=mean(health$LowSleep), color = "red") + # add horizontal line geom
ylab("% adults sleeping less than 7 hours") + # change y-axis label
xlab("") + # change x-axis label
ggtitle("500 Cities Health Outcomes") # add a title
This figure indicates that for all of the cities located in the state of New York, the percent of sampled adults sleeping less than 7 hours was well above the average of the sample for the entire country. It also shows us that we only have one observation for the state of Wyoming (Cheyenne).
A frequently-used indicator of a variable’s spread is the variance. The variance of a sampled variable is computed as follows:
\[ \sigma^2 = \frac{\Sigma (X - \bar{X})^2}{n-1} \]
…where \(X\) is the observed value, \(\bar{X}\) is the mean of the sample, and n
is the number of observations in the sample. This statistic gives us a sense of how much a variable deviates from its mean (\(X - \bar{X}\)). The square root of the variance is the standard deviation, or:
\[ \sigma= \sqrt{\frac{\Sigma (X - \bar{X})^2}{n-1}} \]
Like the variance, the standard deviation tells us how far observations in a dataset are from the mean value. A low standard deviation indicates that most of the observations are close to the average value; a high standard deviation indicates that the numbers are spread out away from the average value of the variable. Let’s visualize the mean and standard deviation of the variable BingeDrinking
in a subset of states:
health %>%
filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "WY", "FL")) %>%
group_by(StateAbbr) %>% # group data by state and compute mn/sd
summarize(meanBD = mean(BingeDrinking), sdBD = sd(BingeDrinking)) %>%
ggplot() + # pipe data into ggplot()
geom_point(aes(x = StateAbbr, y = meanBD)) + # add point at meanBD
geom_errorbar(aes(x = StateAbbr,
ymin = meanBD - sdBD,
ymax = meanBD + sdBD), width = 0.2) + # add error bars representing sd
xlab("") +
ylab("Binge drinking rates") +
ggtitle("500 Cities Health Outcomes")
Here, we manually compute the mean (mean()
) and standard deviation (sd()
) of a grouped subset of our data and then use a combination of ggplot2
tools to create a visualization of the mean (point) and spread (lines) of our data. We are unable to compute a standard deviation for Wyoming since there’s only one observation in the state.
Remember that estimates based on small populations tend to show more variance than estimates based on large populations. Let’s create a funnel plot of our population versus values of BingeDrinking
to illustrate this:
ggplot(health) +
geom_point(aes(x = PopulationCount, y = BingeDrinking), alpha = 0.3) +
xlab("Population") +
ylab("Rate of binge drinking") +
ggtitle("500 Cities Health Outcomes")
Notice how as population increases, rates of binge drinking stabilize between 14 and 17 percent. Cities with populations less than 250,000 have much more variability in binge drinking rates.
Now that you’ve reviewed basic statistics of center and spread, let’s move on to some cool visualization tools that best help us see amounts, distributions, and proportions. In future classes, we’ll also focus on visualizing spatial patterns, change through time, and associations between variables.
The easiest way to visualize the spread of many variables at once is to create a box and whisker plot. These plots visualize the median, the first and third quartiles (the 25th and 75th percentiles), and any outliers that fall beyond the 95th percentile of the data in one elegant plot. They also automatically group data and compute summary statistics, so you can skip that step in your code!
health %>%
filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "WY", "FL")) %>%
ggplot() +
geom_boxplot(aes(x = StateAbbr, y = BingeDrinking, fill = StateAbbr)) +
xlab("") +
ylab("Rate of binge drinking among adults")
The median of each distribution is shown with the horizontal dark line. The box around the median represents the 25% to 75% quartiles in the distribution, and the thin vertical lines represent the area describing 95% of the distribution. See that dot in the Florida column? That’s an outlier, or a city whose rate of binge drinking falls outside of the 95% distribution for the dataset. This city has abnormally low rates of binge drinking based on the distribution of city observations for the state of Florida. We can figure out which city this is using our dplyr
skills:
health %>%
filter(StateAbbr == "FL") %>%
filter(BingeDrinking == min(BingeDrinking))
## CityName StateAbbr PopulationCount BingeDrinking Smoking
## 1 Miami Gardens FL 107167 12.9 21.9
## MentalHealth PhysicalActivity Obesity PoorHealth LowSleep
## 1 15.3 31.4 35.5 15.2 43.7
A very quick way to visualize the shape of a variable is to plot a histogram of that variable. You can do this using either the base R
hist()
function or using ggplot()
.
hist(health$LowSleep, breaks = 100, xlab = "Low sleep rates", ylab = "Frequency", main = "Histogram of low sleep rates")
Changing the breaks()
argument changes the number of bins into which the variable values are placed. Try it!
You can also build a histogram using ggplot()
as follows:
ggplot(data = health) +
geom_histogram(aes(x = LowSleep), bins = 50) +
geom_vline(xintercept=mean(health$LowSleep), color = "red") +
geom_vline(xintercept=median(health$LowSleep), color = "blue") +
xlab("Rate of respondents sleeping less than 7 hours") +
ylab("Frequency") +
ggtitle("500 Cities Health Outcomes")
Using geom_vline()
, I added a vertical red line showing the mean value of LowSleep
and a blue line showing the median value. Does it make sense to you that the median value is slightly lower than the mean?
Histograms are cool and all, but what if you want to visualize MULTIPLE distributions at once? Enter the violin plot:
health %>%
filter(StateAbbr %in% c("FL", "NY", "CA", "AR")) %>%
ggplot(aes(x = StateAbbr, y = BingeDrinking)) +
geom_violin(aes(fill = StateAbbr)) +
geom_boxplot(width = 0.1) +
xlab("") +
ylab("Rate of binge drinking") +
ggtitle("500 Cities Health Outcomes") +
theme(legend.title = element_blank()) # this removes the StateAbbr title on the legend, try building the plot without it!
This slices data into quartiles like a box and whisker plot, but also shows us the shape of the distribution! Cool! Note too that you can put all of the aesthetic information in the ggplot()
function if it will be the same across multiple geometries (geom_violin()
and geom_boxplot()
). This way you don’t have to type the aes(...)
for each geometry.
Let’s go back to our box and whisker plot and start playing with color and aesthetics to make our plot truly awesome:
bwplot <- health %>%
filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "FL")) %>%
ggplot() +
geom_boxplot(aes(x = StateAbbr, y = BingeDrinking, fill = StateAbbr)) +
xlab("") +
ylab("Rate of binge drinking among adults")
bwplot
We know how to add titles (ggtitle()
) and axis labels (xlab()
and ylab()
) to our plots. What if we also want to do the following:
StateAbbr
(the variable names).Let’s work through each of these aesthetic changes one-by-one. Note that the best way to figure out how to change some specific ggplot
aesthetic is to ask our friend the internet. To make changing the aesthetic easier, I’ve dumped our plot into a variable called bwplot
. I can continue adding things to this plot using +
to change the plot contents. See below!
Changing the legend title is as easy as:
bwplot +
guides(fill=guide_legend(title="U.S. States"))
What if we want to remove the legend altogether?
bwplot + theme(legend.position="none")
You can read everything you ever wanted to know about ggplot
legends here. This includes more on how to move the legend, change the background of the legend, and alter the elements in the legend.
You can change everything about the axis labels, from font type to font size. This website provides a great overview of altering axes.
The basic formula for changing axis ticks is:
# x axis tick mark labels
bwplor + theme(axis.text.x= element_text(family, face, colour, size))
# y axis tick mark labels
bwplot + theme(axis.text.y = element_text(family, face, colour, size))
Where:
So imagine we want to increase the size of the numbers on the axis and tilt these numbers at 45 degrees. Let’s also make them blue using the hex-code for aqua.
bwplot + theme(axis.text.x= element_text(face = "bold", colour = "#00FFFF", size = 14, angle = 45))
Ugly, but if we wanted to do the same for the y-axis, we’d just change axis.text.x
to axis.text.y
. We can also hide axis tick marks as follows:
bwplot + theme(axis.text.x= element_blank())
If you review the material on this website you can also learn how to change axis lines, tick resolution, tick mark labels, and the order of items in your plot.
Here’s a full overview of how to edit the title and labels in ggplot
. I’ll overview some of the “best of” below. The basic formula for altering these elements of your plot is:
# main title
bwplot + theme(plot.title = element_text(family, face, colour, size))
# x axis title
bwplot + theme(axis.title.x = element_text(family, face, colour, size))
# y axis title
bwplot + theme(axis.title.y = element_text(family, face, colour, size))
Where:
Here’s an example in which we make the title crazy:
bwplot +
ggtitle("This. Plot. Is. On. Fireeee.") +
theme(plot.title = element_text(color = "orange", size = 20, face = "bold.italic", vjust = 10))
You can remove axis labels as follows:
bwplot + theme(axis.title.y = element_blank())
If we want to reorder the box and whiskers from high to low drinking rates, we need to go back into the dplyr
part of our plot and use the reorder()
function that takes the category to reorder (StateAbbr
) and the other variable by which to reorder (BingeDrinking
):
health %>%
filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "FL")) %>%
ggplot() +
geom_boxplot(aes(x = reorder(StateAbbr, desc(BingeDrinking)), y = BingeDrinking, fill = StateAbbr)) +
xlab("") +
ylab("Rate of binge drinking among adults")
A great overview of color in R can be found here. We’ll focus mostly on playing with color using ggplot2
, but if you want to learn more about color manipulation using base R, check out this post.
What if we want to override the default color palette assigned to our box and whiskers by ggplot
? We can do this manually by selecting the appropriate hex-codes for colors in the plot. I like this website for finding colors:
bwplot +
scale_fill_manual(values = c("blue", "red", "orange", "#6C443B", "#A93FD3"))
# note for points you need to use scale_color_manual
We can also change the hue of the color by altering the lightness (l
) and the chroma (c
, intensity of the color):
bwplot + scale_fill_hue(l = 40, c = 35)
# note for points you need to use scale_color_manual
If you’re like me and not very good at manually selecting colors, you can rely on a color palette already built in R
. One of the go-to packages for color manipulation in R is the RColorBrewer
package. Be sure it’s installed on your machine before you proceed!
library(RColorBrewer)
display.brewer.all()
The RColorBrewer
package contains three general types of palettes:
factors
)Once you find a palette you like, you can visualize it as follows:
display.brewer.pal(n=8, name="Dark2")
Here, the n
attribute is the number of discrete colors you want in the palette.
Let’s add a palette we dig to our plot:
bwplot + scale_fill_brewer(palette = "Dark2")
If you’re a big Wes Anderson fan, then try this:
library(wesanderson)
names(wes_palettes)
## [1] "BottleRocket1" "BottleRocket2" "Rushmore1" "Rushmore"
## [5] "Royal1" "Royal2" "Zissou1" "Darjeeling1"
## [9] "Darjeeling2" "Chevalier1" "FantasticFox1" "Moonrise1"
## [13] "Moonrise2" "Moonrise3" "Cavalcanti1" "GrandBudapest1"
## [17] "GrandBudapest2" "IsleofDogs1" "IsleofDogs2"
wes_palette("Zissou1")
wes_palette("Darjeeling1")
bwplot + scale_fill_manual(values = wes_palette(n=5, name="Darjeeling1"))
If you’re a bit more boring and really want a gray-scale plot, that’s easy too:
bwplot + scale_fill_grey()
In recent years, there has been growing awareness about using palettes that are both color blind friendly and that transfer well to gray-scale (i.e. when converted to black and white).
library(viridis)
bwplot + scale_fill_viridis(discrete=T, option="magma")
Color options include magma
, plasma
, viridis
, inferno
, and cividis
.
The dichromat
package also contains color schemes suitable to folks who are color blind:
library(dichromat)
names(dichromat::colorschemes)
## [1] "BrowntoBlue.10" "BrowntoBlue.12"
## [3] "BluetoDarkOrange.12" "BluetoDarkOrange.18"
## [5] "DarkRedtoBlue.12" "DarkRedtoBlue.18"
## [7] "BluetoGreen.14" "BluetoGray.8"
## [9] "BluetoOrangeRed.14" "BluetoOrange.10"
## [11] "BluetoOrange.12" "BluetoOrange.8"
## [13] "LightBluetoDarkBlue.10" "LightBluetoDarkBlue.7"
## [15] "Categorical.12" "GreentoMagenta.16"
## [17] "SteppedSequential.5"
ggplot
Everything you ever wanted to know about themes found here but let’s play with a few quickly.
theme_bw()
remove the gray background:
bwplot +
theme_bw()
bwplot +
theme_minimal()
bwplot +
theme_dark()
If you want to go theme-crazy, install the ggthemes
package:
library(ggthemes)
bwplot +
theme_tufte()
# The Economist
bwplot +
theme_economist()
# Wall Street Journal
bwplot +
theme_wsj()
You can see all ggthemes here. You can also explore the many many ways you can amp up your ggplotting skills (animations anyone?) here.
health %>%
filter(StateAbbr %in% c("UT", "NY", "CA", "TN", "FL")) %>%
ggplot() +
geom_boxplot(aes(x = reorder(StateAbbr, desc(BingeDrinking)), y = BingeDrinking, fill = StateAbbr)) +
labs(title = "Rates of binge drinking among adults", subtitle = "500 Cities Dataset") +
guides(fill=guide_legend(title="U.S. States")) +
ylab("Drinking rates") +
xlab("") +
scale_fill_manual(values = wes_palette(n=5, name="Darjeeling1")) +
theme_bw()
When working with ggplot2
, save figures using the ggsave()
function. Note that to use this function, you’ll need to dump your ggplot into a variable:
my_sweet_plot <- ggplot(data = my_data) +
geom_point(aes(x = YEAR, y = VALUE))
ggsave(my_sweet_plot, "./myfigure.png")
You can read more about writing your figures to a other types of files here.