Objectives

Get even more familiar with ggplot2 by making a killer violin plot.

Packages

Make sure the tidyverse suite of packages are installed on your machine and loaded in your R session. The tidyverse includes both the dplyr and ggplot2 packages.

library(tidyverse)

Our data

We’ll work with the 500 cities data again this week. You can read more about the data (and play around with the full dataset!) here.

health <- readRDS("./data/health.RDS")
glimpse(health)

## Observations: 500
## Variables: 10
## $ CityName         <fct> Abilene, Akron, Alameda, Albany, Albany, Albu...
## $ StateAbbr        <fct> TX, OH, CA, GA, NY, NM, VA, CA, TX, PA, TX, C...
## $ PopulationCount  <int> 117063, 199110, 73812, 77434, 97856, 545852, ...
## $ BingeDrinking    <dbl> 16.2, 14.8, 15.0, 10.9, 15.5, 14.5, 15.1, 12....
## $ Smoking          <dbl> 19.6, 26.8, 11.9, 23.7, 19.0, 18.8, 13.0, 12....
## $ MentalHealth     <dbl> 11.6, 15.3, 9.8, 16.2, 13.2, 11.6, 8.4, 10.1,...
## $ PhysicalActivity <dbl> 27.7, 31.0, 18.7, 33.1, 26.1, 20.4, 17.6, 24....
## $ Obesity          <dbl> 33.7, 37.3, 18.7, 40.4, 31.1, 25.5, 23.3, 18....
## $ PoorHealth       <dbl> 12.6, 15.5, 9.6, 17.4, 13.1, 12.1, 8.4, 11.4,...
## $ LowSleep         <dbl> 35.4, 44.1, 32.3, 46.9, 39.7, 32.8, 34.5, 38....

Here’s some additional information about each of the variables in the dataset:

CityName is the name of each of the 500 cities. Note that some names repeat (e.g. Albany, NY and Albany, GA)
StateAbbr is the abbreviation for the state in which the city is located, including Washington D.C.
PopulationCount is the total population in 2014 for each city.
BingeDrinking is the rate of binge drinking among adults aged >= 18 years.
Smoking is the rate of current smoking among adults aged >= 18 years.
MentalHealth is the rate of low mental health for more than 14 days among adults aged >= 18 years.
PhysicalActivity is the rate of low physical activity among adults aged >= 18 years.
Obesity is the rate of obesity among adults aged >= 18 years.
PoorHealth is the rate of low physical health for more than 14 days among adults aged >= 18 years.
LowSleep is the rate of sleeping less than 7 hours among adults aged >= 18 years.

One classy violin plot

This week, we’ll continue working on becoming professional-ggplotters by creating a really awesome violin plot. Violin plots are like box-and-whisker plots in that they let us quickly visually compare a single variable across groups. They are even cooler than box-and-whisker plots (I know, I know, how is it POSSIBLE to be cooler than our friends B&W!?) because not only do they show the center and spread of a variable, they also show us the shape of the distributions.

Say we’re working for the CDC and interested in allocating lots of money to states in the US to address smoking. Let’s pick five random states to compare using filter() and visualize the distributions of smoking rates among cities in each state using a violin plot (geom_violin()):

p <- health %>%
  filter(StateAbbr %in% c("UT", "FL", "NY", "ID")) %>%
  ggplot() +  
  geom_violin(aes(x = StateAbbr, y = Smoking))  
p

This plot shows us a bit more information that a box-and-whisker plot, since it also gives us a sense of the shape of city-level distributions in each state in our dataset. We’re missing, however, visual cues about the median and spread (quartiles) of each state’s smoking rates provided by default in our box-and-whisker plots. Let’s add those and update our plot object p:

p + geom_violin(aes(x = StateAbbr, y = Smoking), draw_quantiles = c(0.25, 0.5, 0.75))

Nice! Another way to do this is:

p + geom_boxplot(aes(x = StateAbbr, y = Smoking), width = 0.2)

I’m going to stick with the first version because I think it looks much better. Ok, let’s update the colors of the violin plots. I’ll also update the legend title, reorder the box and whiskers from high to low, and update axis labels and title (see last week’s tutorial for explanation of this):

p2 <- p + 
  geom_violin(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking, fill = StateAbbr), draw_quantiles = c(0.25, 0.5, 0.75)) +
  guides(fill=guide_legend(title="U.S. States")) +
  xlab("") +
  ylab("Smoking rates") +
  labs(title = "Rates of binge smoking among adults over 18", subtitle = "500 Cities Dataset")
p2

This looks OK, but I’d like to remove the gray background:

p3 <- p2 + theme_bw()
p3

Finally, I’d like to drop the legend on the bottom:

p4 <- p3 + theme(legend.position = "bottom")
p4

Here’s all the code needed to reproduce this figure:

final_violin <- health %>%
  filter(StateAbbr %in% c("UT", "FL", "NY", "ID")) %>%
  ggplot() +  
  geom_violin(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking, fill = StateAbbr), draw_quantiles = c(0.25, 0.5, 0.75)) +
  guides(fill=guide_legend(title="U.S. States")) +
  xlab("") +
  ylab("Smoking rates") +
  labs(title = "Rates of binge smoking among adults over 18", subtitle = "500 Cities Dataset") +
  theme_bw() +
  theme(legend.position = "bottom") 

final_violin

Now wouldn’t it ALSO be really really cool to add the actual data points to this plot?? Let’s do it!

final_violin +
  geom_point(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking))

In Florida, there are so many cities, it’s a bit hard to distinguish between points. Let’s “jitter” the points to facilitate visualization. This is a tool that adds random noise to our data to facilitate visualization (to prevent over-plotting). In this case, it moves the data points to the left and right so we can better see distinct cities.

final_violin +
  geom_jitter(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking))

That’s a bit too much jitter. We can reduce the width of the jitter like so:

final_violin +
  geom_jitter(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking), width=0.05)

Saving your figures

When working with ggplot2, save figures using the ggsave() function. Note that to use this function, you’ll need to dump your ggplot into a variable:

my_sweet_plot <- ggplot(data = my_data) +
  geom_point(aes(x = YEAR, y = VALUE))

ggsave(my_sweet_plot, "./myfigure.png")

You can read more about writing your figures to a other types of files here.

Additional Resources

More on creative ways to visualize data distributions here
Secrets of a happy graphing life
The ggplot2 cheatsheet
Chapters 5 - 12 in this online textbook provide a great overview of when to use different visualization strategies.

Fancy Violin Plots