In this tutorial, we’ll learn how to use linear regression (ordinary least squares) to predict the values of one numerical variable from the values of other variables.
R
.Let’s get started!
You’ll need the following:
library(tidyverse)
Before we dive into linear regression, let’s step back and remember our more general discussion of models last week. Remember that statistical models are ultimately about estimating a function \(f\) that best fits our data. Said differently, models are just tools for extracting patterns out of our data. The ways in which we define \(f\) depend both on the nature of our data and on the question we are asking. This week, we are going to focus on how we can use statistical models to answer or understand inferential questions, so models that seek to understand relationships between variables.
Remember that while traditional statistics courses tend to focus on models for inference, or for confirming that a hypothesis is true, in this class we are focusing more on models for exploration rather than confirmation. This distinction is important. To confirm a hypothesis, you must use data that is independent of the data you used to generate your hypothesis. Think of it this way1: each observation in your data can be used for either exploration or confirmation, but not both. You can use an observation as much as you like when exploring the data, but only one time for confirmation.2 This is necessary because to confirm a hypothesis you must use data independent of the data that you used to generate the hypothesis. There’s nothing wrong with exploration, like what we’re doing in this class, but you should never claim that an exploratory analysis is confirmatory.
Also, don’t forget the terminology we’ll use throughout the rest of the class as we discuss statistical models:
Say we have the following data, where our explanatory variable (temp
) is average seasonal temperature in degrees Celsius and our response variable (yield
) is the yield of corn on a farmer’s fields in bushels per acre. We’re interested in understanding how changes in temperature affect agricultural productivity, especially since we know that temperatures are very likely to change as climate changes in the future.3
temp <- c(21.7, 22, 23.2, 22.3, 23.3, 24.7, 25.1, 26.5, 27.8, 27.1, 28.2, 29.7)
yield <- c(122.3, 124.6, 137.8, 130.1, 142.3, 140.2, 134.4, 145.1, 144.2, 143.3, 150.6, 146.1)
df <- data.frame(temp, yield)
# If you haven't already, check out the DT package for fancy interactive tables in your RMarkdown!
library(DT)
datatable(df)
Let’s start by visualizing our two variables, \(temp\) and \(yield\):
ggplot(data = df) +
geom_point(aes(x = temp, y = yield)) +
theme_bw() +
ylab("Corn yields (bushels/acre)") +
xlab("Average seasonal temperature (°C)")
Remember that each point in this scatterplot is a row in our data.frame
with coordinates (\(temp, yield\)). At a first glance, it looks like there is a positive relationship between \(temp\) and \(yield\). This means that, in general, as \(temp\) increases, \(yield\) also increases, which makes sense… in general, warmer temperatures bring higher corn yields.4 This also means that if we compute the correlation between \(temp\) and \(yield\), we should get a positive number. Try it!
Like correlation, linear regression measures the linear relationship between two variables, but there are some key differences. Remember that correlation measures the association between two variables (think scatterplot). Regression, on the other hand, fits a function through data to predict how changes in one variable might affect another variable. Linear regression assumes that this function is a straight line. You’ve probably seen the basic formula for linear regression before:
\[ Y = a + bX\]
Here, \(\beta_0\) and \(\beta_1\) are two model parameters describe the function (intercept and slope of the line in this case, but these parameters increase in complexity as your model increases in complexity). We write their point estimates as \(b_0\) and \(b_1\). The variables \(x\) and \(y\) are our observations; \(x\) is often referred to as the explanatory variable and \(y\) as the response variable.
If we run a linear regression where we predict how an increase in our explanatory variable \(x\) affects our response variable \(y\), we are essentially finding a line that best fits the scatter plot shown above. We can do this using the lm()
function in R
. The main inputs to this function are the formula, expressed as \(y~x\), and the data.frame
:
lm(yield ~ temp, df)
##
## Call:
## lm(formula = yield ~ temp, data = df)
##
## Coefficients:
## (Intercept) temp
## 68.622 2.777
The lm()
function has now defined a line that best fits our data. This data has an intercept (\(\beta_0\)) of 68.62
and a slope (\(\beta_1\)) of 2.77
. We’ll come back to this function and how we interpret it’s output. First, let’s visualize what this looks like and talk a bit more about what the lm()
function is actually doing.
lr1 <- lm(yield ~ temp, df)
df <- transform(df, fitted = fitted(lr1)) # this pulls out the predicted values from the linear regression and adds them to the dataframe
ggplot(data = df, aes(x = temp, y = yield)) +
geom_point(color = "red") +
geom_smooth(method = "lm", se = FALSE) +
theme_bw() +
ggtitle("Linear regression line")
Technically, the lm()
function minimizes the sum of the squares of the residuals between the observed and predicted values. What!? Let’s break this down. First, what’s a residual? A residual is the difference between your observation (rows in the data.frame
) and the expected value predicted by the model. We write this as follows:
\[ e_i = y_i - \hat{y_i} \]
Here, \(e_i\) is our residual, or the difference between our actual observation (\(y_i\)) and our estimate from our model (\(\hat{y_i}\)). Let’s visualize these residuals.
ggplot(data = df, aes(x = temp, y = yield)) +
geom_point(color = "red") +
geom_smooth(data = df, method = "lm", se = FALSE) +
geom_segment(aes(x = temp, y = yield, xend = temp, yend = fitted)) +
theme_bw() +
ggtitle("Visualizing residuals")
So why the “sum of squares”? Residuals can be positive or negative, but we don’t care about that as much as we care about their distance from the regression line. By squaring the residuals, we eliminate the positive and negative aspect of the residuals, and by minimizing the sum of these squared residuals, we find the line for which the distance between all points and the regression line is the smallest. We can think of this as the line that best fits the data. The key word in the preceding sentence was line. Linear regression forces, well, a linear fit to our data. Sometimes a linear fit isn’t best, as in the example below.
Really, the function that best fits this data is probably more parabolic than linear:
One more word of warning. Imagine our original data contains a big outlier, or a value that is abnormally far from the other values of a variable. Outliers can significantly change the shape of our regression line. In the figure below, the blue line is our original regression line and the red line is the regression line with the addition of an outlier point (also in red).
temp <- c(21.7, 22, 23.2, 22.3, 23.3, 24.7, 25.1, 26.5, 27.8, 27.1, 28.2, 29.7)
yield <- c(122.3, 124.6, 137.8, 130.1, 142.3, 140.2, 134.4, 145.1, 144.2, 143.3, 150.6, 200.1)
df3 <- data.frame(temp, yield)
pt <- df3 %>% filter(yield == max(yield))
ggplot(data = df3, aes(x = temp, y = yield)) +
geom_point() +
geom_smooth(data = df, method = "lm", se = FALSE) +
geom_smooth(data = df3, method = "lm", se = FALSE, color = "red") +
geom_point(data = pt, color = "red") +
theme_bw() +
ggtitle("Watch out for outliers!")
Finally, let’s go back to the ls()
function and learn how to interpret the results. Remember, we run a linear regression as follows:
lm(y~x, data = df)
The formula (y~x
) can be as long and as complicated as you want, depending on the number of variables you have. The data.frame
needs to contain each of the variables in the regression formula. When we run the code above, here’s the output we get:
##
## Call:
## lm(formula = yield ~ temp, data = df)
##
## Coefficients:
## (Intercept) temp
## 68.622 2.777
This gives us the basic results from the regression. In this case, it tells us the intercept and the slope of the line estimated by the lm()
function. If you want more information than this, store the results of the linear regression in an object and call summary()
on this object:
lr1 <- lm(yield~temp, data=df)
summary(lr1)
##
## Call:
## lm(formula = yield ~ temp, data = df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -6.5823 -4.1926 -0.5133 3.1568 8.9745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 68.6216 14.0331 4.89 0.000632 ***
## temp 2.7770 0.5554 5.00 0.000537 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.969 on 10 degrees of freedom
## Multiple R-squared: 0.7143, Adjusted R-squared: 0.6857
## F-statistic: 25 on 1 and 10 DF, p-value: 0.0005375
These results give us a bit more information including:
Std. Error
) and significance (Pr>|t|)
).Multiple R-squared
and Adjusted R-squared
.Given the pre-reqs for this class, I’m assuming you have a sense of how frequentist significance works, so we won’t go into the weeds of that, but I will explain the fit statistics in more detail below.
Final note: if you want to clean up the summary()
results of a linear model, you can use the broom
package as follows:
library(broom)
tidy(lr1)
## # A tibble: 2 x 5
## term estimate std.error statistic p.value
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 68.6 14.0 4.89 0.000632
## 2 temp 2.78 0.555 5.00 0.000537
This function transforms your results into a handy-dandy data.frame
that you can wrangle to your heart’s content.
To use any modeling tool, it’s important to understand the assumptions implicit in the use of that tool. In the case of the lm()
function, which applies Ordinary Least Squares (OLS) regression to a set of variables, your data needs to meet the following criteria in order for OLS to be the Best Linear Unbiased Estimator (BLUE):
There are other big things to consider like collinearity in your explanatory variables (just a fancy word for correlation between variables), normality of your data, and any omitted variables and/or confounding variables. You now know how to look at a variable’s shape (histograms!) and correlation (cor()
and corrplot()
!). Let’s instead focus on looking at the structure of what we failed to explain by our model: the residuals! Then we’ll talk a bit about how you can use some of the visualization tools you’ve already mastered to think about one of the biggest challenges in regression: omitted variable bias!
Residuals first. OK, how do you pull residuals (the difference between what the model predicts and what you actually observed) from a regression? Let’s explore residuals from a slightly fancier data than the toy data we’ve been working with thus far.
corn <- readRDS("./data/corn.RDS") %>% filter(STATE_NAME %in% c("Georgia",
"Idaho",
"Iowa"),
YEAR == max(YEAR))
It’s corn, yet again! Let’s say we’re interested in how FERTILIZER_EXP
(per acre expenditures on fertilizer) affects CORN_YIELD
. We could start by computing the correlation between these two variables:
cor(corn$CORN_YIELD, corn$FERTILIZER_EXP)
## [1] 0.3941222
We could also plot the data:
ggplot() +
geom_point(data = corn, aes(x = FERTILIZER_EXP, y = CORN_YIELD)) +
theme_minimal()
Together, this looks like there may be a weak positive association, which makes sense based on what we know about reality.
Let’s try out a regression:
lm1 <- lm(CORN_YIELD ~ FERTILIZER_EXP, data = corn)
summary(lm1)
##
## Call:
## lm(formula = CORN_YIELD ~ FERTILIZER_EXP, data = corn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -88.196 -6.699 6.131 16.799 66.233
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 155.2764 8.0234 19.353 < 2e-16 ***
## FERTILIZER_EXP 0.5960 0.1274 4.678 7.72e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 27.12 on 119 degrees of freedom
## Multiple R-squared: 0.1553, Adjusted R-squared: 0.1482
## F-statistic: 21.88 on 1 and 119 DF, p-value: 7.722e-06
Pretty low R2, so we’re not doing a great job of explaining yield variability. But fertilizer use seems to have a positive and significant effect on yields. We’d interpret the coefficient above as follows: For a one dollar increase in fertilizer expenditure per acre, yields increase by 0.596 bushels per acre. Be sure to always think about the units of your predictors. In this case, we might thing about this effect as being small, but given that farmers tend to spend an average of 59.93 per acre on fertilizer, this is actually a pretty large effect.
As with correlation, remember that when applying any regression, the way in which you slice your data will drastically alter your results. Consider this simple univariate visualization of the relationship between fertilizer and yields across the states in our sample:
ggplot(data = corn) +
geom_point(aes(x = FERTILIZER_EXP, y = CORN_YIELD), alpha = 0.2) +
geom_smooth(aes(x = FERTILIZER_EXP, y = CORN_YIELD), method = "lm", se=F) +
theme_bw() +
labs(x = "Fertilizer application ($/operated acre)",
y = "Corn yields (bu/ac)",
title = "Inputs and corn yields in the coterminous U.S.") +
facet_wrap(~ STATE_NAME, ncol = 3) +
theme(legend.position = "bottom")
What other factors might affect yields? Soil? Water? Sun? Yes. So a multiple linear regression (more than one predictor) is probably a better option:
lm2 <- lm(CORN_YIELD ~ FERTILIZER_EXP + SOIL_SUITABILITY + GDD + TP, data = corn)
summary(lm2)
##
## Call:
## lm(formula = CORN_YIELD ~ FERTILIZER_EXP + SOIL_SUITABILITY +
## GDD + TP, data = corn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -78.410 -8.437 1.215 14.505 52.717
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 219.59756 22.06114 9.954 < 2e-16 ***
## FERTILIZER_EXP 0.46214 0.11101 4.163 6.06e-05 ***
## SOIL_SUITABILITY 12.50697 13.91887 0.899 0.3707
## GDD -0.01034 0.00749 -1.381 0.1701
## TP -0.04609 0.01777 -2.593 0.0107 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.86 on 116 degrees of freedom
## Multiple R-squared: 0.4148, Adjusted R-squared: 0.3946
## F-statistic: 20.55 on 4 and 116 DF, p-value: 8.013e-13
Our R2 higher, which makes sense b/c we’ve added new significant predictors, but it’s still not stellar, suggesting we’re missing some pretty big drivers of yield. WOWZA look at the effect of SOIL_SUITABILITY
!! But wait. This is just due to the difference in units across explanatory variables. SOIL_SUITABILITY
is an index with a very different range from GDD
(growing degree days in a season) or TP
(total precipitation). We’d interpret the coefficient on SOIL_SUITABILITY
as “given average values of other predictors, a one unit increase in the soil suitability index is associated with yield gains of 116 bushels/acre”, where as for growing degree days, we’d say “given average values of other predictors, a one degree increase in seasonal growing degree days is associated with yield gains of 0.02 bushels/acre” (note, though, that this effect is not significant!). To address this interpretation issue, you can always just standardize all predictors prior to running your regression. Then you can compare in terms of standard deviation rather than absolute values. Also, check out the intercept. This gives us an estimate of baseline yields if predictor values are all set to zero.
Ok back to the main objective here, how do we pull residuals out of our model results? How do we inspect these residuals? An easy way is using the handy-dandy residuals()
function:
corn$my_residuals <- residuals(lm2)
This code adds a new column to my data.frame
called my_residuals
. This column shows us, for each observation, the difference between observed yields (actual value in the data) and the yields predicted by the model. It’s kinda’ an assessment of how wrong our model was for each observation. We can check a number of the BLUE assumptions by looking at these residuals, for example:
hist(corn$my_residuals)
They are not entirely normal, but the mean of the residuals is -8.959775110^{-16}, very close to zero.
My favorite thing to do with residuals is to map them! This let’s us detect spatial patterns lurking in the residuals, and gives us clues about important variables we’ve omitted from our analysis. Here’s how (and don’t stress, you don’t need to do this for a grade, just for fun!). To do this, you’ll need to load the cb_2018_us_county_500k.shp
you used during the Spatial Data Science module, which you can re-download here:
library(sf)
cty <- st_read("./data/cb_2018_us_county_500k.shp", quiet=T) %>%
filter(GEOID %in% unique(corn$GEOID)) # pull out only counties in our corn data
# now merge our data to this shapefile
corn_sf <- merge(cty, corn, by = "GEOID", all = T)
# and visualize!
ggplot() +
geom_sf(data = corn_sf, aes(fill = my_residuals)) +
theme_bw() +
scale_fill_gradient(low = "white", high = "dark green")
Admittedly a weird map, but if you were looking at the entire country, you could detect interesting patterns in the variance unexplained by your model! Here’s how you’d do that:
corn_full <- readRDS("./data/corn.RDS")
lm2 <- lm(CORN_YIELD ~ FERTILIZER_EXP + SOIL_SUITABILITY + GDD + TP, data = corn_full)
corn_full$my_residuals <- residuals(lm2)
cty <- st_read("./data/cb_2018_us_county_500k.shp", quiet=T)
# now merge our data to this shapefile
corn_sf <- merge(cty, corn_full, by = "GEOID")
# and visualize!
ggplot() +
geom_sf(data = corn_sf, aes(fill = my_residuals)) +
theme_bw() +
scale_fill_gradient2(low = "dark blue", mid = "white", high = "dark green") # diverging color palette
The counties you can’t see are just counties with no corn data. See any interesting patterns?
I go over a few ways to assess model performance in the video for this week. I also emphasize the approach of holding out some independent “test” data to assess how well your model performs at predicting real values. Let’s try this programatically. First, we need to split our corn
data randomly into testing and training data. We can do this like so:
set.seed(1) # this ensures you generate the same random row numbers each time
# hold out 25% of the data
random_rn <- sample(nrow(corn), ceiling(nrow(corn)*.25))
train <- corn[-random_rn,] # remove the random_rn row numbers
test <- corn[random_rn,] # keep the random_rn row numbers
nrow(train)
## [1] 90
nrow(test)
## [1] 31
Say we run a bunch of models and settle on one we really like, lm3
:
lm3 <- lm(CORN_YIELD ~ GDD + TP + SOIL_SUITABILITY + PERC_IRRIGATED + FERTILIZER_EXP, data = corn)
summary(lm3)
##
## Call:
## lm(formula = CORN_YIELD ~ GDD + TP + SOIL_SUITABILITY + PERC_IRRIGATED +
## FERTILIZER_EXP, data = corn)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.575 -9.079 2.228 13.732 57.682
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 213.990717 20.758832 10.308 < 2e-16 ***
## GDD -0.017920 0.007274 -2.463 0.0152 *
## TP -0.024591 0.017501 -1.405 0.1627
## SOIL_SUITABILITY 33.345680 14.034191 2.376 0.0192 *
## PERC_IRRIGATED 73.271958 17.989442 4.073 8.56e-05 ***
## FERTILIZER_EXP 0.238531 0.117803 2.025 0.0452 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 21.47 on 115 degrees of freedom
## Multiple R-squared: 0.4886, Adjusted R-squared: 0.4663
## F-statistic: 21.97 on 5 and 115 DF, p-value: 2.142e-15
We can assess how well this does at predicting CORN_YIELD
given the explanatory variables we specify by predicting using the explanatory variables in the test
data. To do this, we feed the predict()
function our model and our test data:
predictions <- predict(lm3, test) # aka, use the model we've build to predict our outcome, y, given the values of X in the test data
predictions
## 121 68 39 1 34 87 43 14
## 207.5790 203.5043 178.7257 163.2099 170.8295 190.1427 219.3042 174.1179
## 82 59 51 97 85 21 106 54
## 208.5304 219.1489 210.9672 214.8245 197.3881 147.2898 193.3993 204.1085
## 74 7 73 79 109 37 89 100
## 202.8729 166.6056 198.2635 207.2776 197.6214 172.7589 186.3649 213.3997
## 117 99 44 102 33 84 35
## 200.0173 209.7871 205.2322 194.3187 145.8687 184.6390 162.0201
This returns a list of predicted yields for each observation (e.g. each set of explanatory variables). We can then compute the mean squared error, or the mean squares of the difference between the vector of predictions and the vector of observed values:
mean((test$CORN_YIELD - predictions)^2)
## [1] 296.0894
Our goal is to find the model that minimizes our mean squared error (or how far off our predictions are from our actual observed values).
Dig this new RMD style?? Thanks to a student in our class for pointing me in this direction! If you want your RMD to look like this, simply add the following to your header:
---
title: "My title"
author: "Dr. Burchfield"
output:
html_document:
theme: flatly
toc: yes
toc_float: true
---
And for something really cool, try this:
---
title: "My title"
author: "Dr. Burchfield"
output:
html_document:
theme: flatly
toc: yes
toc_float: true
code_folding: hide
---
Thanks to Dr. Mine Cetinkaya-Rundel for her awesome summary of this important distinction.↩
One option for confirmatory analysis is to split your data into testing and training data, which we’ll come back to later in the class. This approach allows you to explore the training data, and test your final model on the test data.↩
So much to read here, but check out some of the data visualizations created by the IPCC summarizing likely impacts of climate change on our food systems here.↩
Up to a certain point, see for example this paper by Schlenker and Roberts that describes how temperatures above 29°C can be very harmful for corn yields. They use a nonlinear approach to capture this relationships.↩