ggplot2
by making a killer violin plot.Make sure the tidyverse
suite of packages are installed on your machine and loaded in your R
session. The tidyverse
includes both the dplyr
and ggplot2
packages.
library(tidyverse)
We’ll work with the 500 cities data again this week. You can read more about the data (and play around with the full dataset!) here.
health <- readRDS("./data/health.RDS")
glimpse(health)
## Observations: 500
## Variables: 10
## $ CityName <fct> Abilene, Akron, Alameda, Albany, Albany, Albu...
## $ StateAbbr <fct> TX, OH, CA, GA, NY, NM, VA, CA, TX, PA, TX, C...
## $ PopulationCount <int> 117063, 199110, 73812, 77434, 97856, 545852, ...
## $ BingeDrinking <dbl> 16.2, 14.8, 15.0, 10.9, 15.5, 14.5, 15.1, 12....
## $ Smoking <dbl> 19.6, 26.8, 11.9, 23.7, 19.0, 18.8, 13.0, 12....
## $ MentalHealth <dbl> 11.6, 15.3, 9.8, 16.2, 13.2, 11.6, 8.4, 10.1,...
## $ PhysicalActivity <dbl> 27.7, 31.0, 18.7, 33.1, 26.1, 20.4, 17.6, 24....
## $ Obesity <dbl> 33.7, 37.3, 18.7, 40.4, 31.1, 25.5, 23.3, 18....
## $ PoorHealth <dbl> 12.6, 15.5, 9.6, 17.4, 13.1, 12.1, 8.4, 11.4,...
## $ LowSleep <dbl> 35.4, 44.1, 32.3, 46.9, 39.7, 32.8, 34.5, 38....
Here’s some additional information about each of the variables in the dataset:
CityName
is the name of each of the 500 cities. Note that some names repeat (e.g. Albany, NY and Albany, GA)StateAbbr
is the abbreviation for the state in which the city is located, including Washington D.C.PopulationCount
is the total population in 2014 for each city.BingeDrinking
is the rate of binge drinking among adults aged >= 18 years.Smoking
is the rate of current smoking among adults aged >= 18 years.MentalHealth
is the rate of low mental health for more than 14 days among adults aged >= 18 years.PhysicalActivity
is the rate of low physical activity among adults aged >= 18 years.Obesity
is the rate of obesity among adults aged >= 18 years.PoorHealth
is the rate of low physical health for more than 14 days among adults aged >= 18 years.LowSleep
is the rate of sleeping less than 7 hours among adults aged >= 18 years.This week, we’ll continue working on becoming professional-ggplotters by creating a really awesome violin plot. Violin plots are like box-and-whisker plots in that they let us quickly visually compare a single variable across groups. They are even cooler than box-and-whisker plots (I know, I know, how is it POSSIBLE to be cooler than our friends B&W!?) because not only do they show the center and spread of a variable, they also show us the shape of the distributions.
Say we’re working for the CDC and interested in allocating lots of money to states in the US to address smoking. Let’s pick five random states to compare using filter()
and visualize the distributions of smoking rates among cities in each state using a violin plot (geom_violin()
):
p <- health %>%
filter(StateAbbr %in% c("UT", "FL", "NY", "ID")) %>%
ggplot() +
geom_violin(aes(x = StateAbbr, y = Smoking))
p
This plot shows us a bit more information that a box-and-whisker plot, since it also gives us a sense of the shape of city-level distributions in each state in our dataset. We’re missing, however, visual cues about the median and spread (quartiles) of each state’s smoking rates provided by default in our box-and-whisker plots. Let’s add those and update our plot object p
:
p + geom_violin(aes(x = StateAbbr, y = Smoking), draw_quantiles = c(0.25, 0.5, 0.75))
Nice! Another way to do this is:
p + geom_boxplot(aes(x = StateAbbr, y = Smoking), width = 0.2)
I’m going to stick with the first version because I think it looks much better. Ok, let’s update the colors of the violin plots. I’ll also update the legend title, reorder the box and whiskers from high to low, and update axis labels and title (see last week’s tutorial for explanation of this):
p2 <- p +
geom_violin(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking, fill = StateAbbr), draw_quantiles = c(0.25, 0.5, 0.75)) +
guides(fill=guide_legend(title="U.S. States")) +
xlab("") +
ylab("Smoking rates") +
labs(title = "Rates of binge smoking among adults over 18", subtitle = "500 Cities Dataset")
p2
This looks OK, but I’d like to remove the gray background:
p3 <- p2 + theme_bw()
p3
Finally, I’d like to drop the legend on the bottom:
p4 <- p3 + theme(legend.position = "bottom")
p4
Here’s all the code needed to reproduce this figure:
final_violin <- health %>%
filter(StateAbbr %in% c("UT", "FL", "NY", "ID")) %>%
ggplot() +
geom_violin(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking, fill = StateAbbr), draw_quantiles = c(0.25, 0.5, 0.75)) +
guides(fill=guide_legend(title="U.S. States")) +
xlab("") +
ylab("Smoking rates") +
labs(title = "Rates of binge smoking among adults over 18", subtitle = "500 Cities Dataset") +
theme_bw() +
theme(legend.position = "bottom")
final_violin
Now wouldn’t it ALSO be really really cool to add the actual data points to this plot?? Let’s do it!
final_violin +
geom_point(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking))
In Florida, there are so many cities, it’s a bit hard to distinguish between points. Let’s “jitter” the points to facilitate visualization. This is a tool that adds random noise to our data to facilitate visualization (to prevent over-plotting). In this case, it moves the data points to the left and right so we can better see distinct cities.
final_violin +
geom_jitter(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking))
That’s a bit too much jitter. We can reduce the width of the jitter like so:
final_violin +
geom_jitter(aes(x = reorder(StateAbbr, desc(Smoking)), y = Smoking), width=0.05)
When working with ggplot2
, save figures using the ggsave()
function. Note that to use this function, you’ll need to dump your ggplot into a variable:
my_sweet_plot <- ggplot(data = my_data) +
geom_point(aes(x = YEAR, y = VALUE))
ggsave(my_sweet_plot, "./myfigure.png")
You can read more about writing your figures to a other types of files here.