R
to quickly complete repetitive tasks.R
tailor-made to your research interests and questions (aka become a programming superman/woman).apply()
functions in R
.We don’t need much today since we’ll try to build our own tools rather than rely on functions stored in packages.
library(tidyverse)
For loops are great when you have a repetitive task you want to apply in the same way to many things. Imagine, for example, that you want to load a bunch of rasters, reproject them, add ten to all pixels, and then save the rasters… or imagine you have a big data frame and you want to systematically subset by some category (state?)
Let’s start with a conceptual overview of what a for loop does:
for (value in sequence) {
do some stuff
}
So say we have a vector of values, say years, and we want to print out every year in the vector:
years <- c(1997, 2007, 2017)
for (y in years){ # for value in sequence
print(y)
}
## [1] 1997
## [1] 2007
## [1] 2017
This for loop goes through the sequence vector years
and prints out each value (here coded as y
, but we could call it anything we wanted). We can add some additional complexity to the for loop:
for (y in years){ # for value in sequence
print(paste("The year is", y))
}
## [1] "The year is 1997"
## [1] "The year is 2007"
## [1] "The year is 2017"
The y in years
basically tells R
to go through each element in the years
vector. What if we do this?
for (y in 1:length(years)) {
print(paste("The year is", y))
}
## [1] "The year is 1"
## [1] "The year is 2"
## [1] "The year is 3"
This is very different. Here we assign the values of 1,2,3
to the variable y
by calling 1:length(years)
rather than referring directly to the years
vector. We would have to do the following to generate the same output as the first for loop:
for (y in 1:length(years)) {
print(paste("The year is", years[y]))
}
## [1] "The year is 1997"
## [1] "The year is 2007"
## [1] "The year is 2017"
Now our for loop assigns the values 1, 2, 3
to why as it moves through iterations of the loop, and uses these values to index the years vector, for example as years[1]
which is 1997
. This way of creating for loops can be really useful for larger more complex datasets, say for example, if you wanted to loop through UNIQUE values in a particular vector.
I often use this approach to loop through FILES on my machine, perform some operation on them, then save the updated version… here’s a toy example:
flist <- Sys.glob("./data/*.tif") # creates a list of all of the files with the .tif extension in my data folder
for (i in 1:length(flist)) {
r <- raster(flist[i]) # loads the ith raster in flist
r <- r+10 # apply some arbitraty function to the raster, or build your own (see below) and apply!
writeRaster(r, paste0("./out/", i, ".tif")) #save each new raster in a new folder named using the iterator i
}
Ok, now here is where R
gets extremely powerful. Once you understand how to build and use functions to execute major tasks in your projects, you can start (1) building a repository of your own functions you can use on multiple projects, (2) learn how to write code that can operate in many different situations (generalizable code), and (3) write code that’s much easier for you and others to read.
Let’s start with a really basic example. Say we want to create a function that squares all values in a vector:
myvector <- c(1, 3, 10)
myfunction <- function(x) {
out <- x^2
return(out)
}
myfunction(x = myvector)
## [1] 1 9 100
Here, we have to name our function (myfunction
in this case), then use the function()
function (that’s meta) to create a function. In the parentheses of the function()
function, we need to list any arguments we want the function to have, in this case the input vector on which we want to perform some computations. Then we use squiggly brackets ({}
) before and after the meat of the function, the code that actually does the work we want to do. It’s important to return()
something from the function. This is the output that will be generated by the function. If we fail to return anything then the function won’t return anything!!
myvector <- c(1, 3, 10)
myfunction <- function(x) {
out <- x^2
}
myfunction(x = myvector) # nada
If, on the other hand, we don’t dump the results of x^2
into the out
variable, the function will return the results of that operation. The function basically returns whatever is printed out in the console when you run the code inside of the {}
brackets.
myvector <- c(1, 3, 10)
myfunction <- function(x) {
x^2
}
myfunction(x = myvector) # prints out results
## [1] 1 9 100
Best practice is to use the return()
function b/c you could imagine a more complex function that does many things…
myvector <- c(1, 3, 10)
myfunction <- function(x) {
v1 <- x^2
v2 <- v1/3
v3 <- v2 + 100
return(v3)
}
myfunction(x = myvector) # prints out results
## [1] 100.3333 103.0000 133.3333
What if you want to return more than one thing (say v2
and v3
from the example above)? Consider using a list:
myvector <- c(1, 3, 10)
myfunction <- function(x) {
v1 <- x^2
v2 <- v1/3
v3 <- v2 + 100
out <- list(v2, v3)
return(out)
}
out <- myfunction(x = myvector) # prints out results
out[[1]]
## [1] 0.3333333 3.0000000 33.3333333
out[[2]]
## [1] 100.3333 103.0000 133.3333
Now let’s work through a real-world example. Say you’re working on a big research project where you need to complete the following tasks:
It’s helpful to start a project by writing out the major tasks you’ll need to complete for the project. This discretizes the project into smaller parts. The idea is to build a FUNCTION to execute each tasks, for example load_data()
, merge_data()
, and create_viz()
. Then, to implement your entire project, you’d basically just run the following code (with all the scripts to build the data behind the scenes, see the “File management” section below for tips on how to organize all of this):
load_data()
merge_data()
create_viz()
Let’s work through an example so this makes more sense. Say we have three spreadsheets with yields for corn, soy, and winter wheat. Luckily this data is stored in a consistent format since it comes from the same place (USDA), but each dataset needs a bit of processing and tidying before you merge them together… Here I’ve loaded one of the datasets so you can get a sense of what it looks like:
corn <- read.csv("./data/corn.csv")
glimpse(corn)
## Observations: 45,181
## Variables: 21
## $ Program <fct> SURVEY, SURVEY, SURVEY, SURVEY, SURVEY, SURVE...
## $ Year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 201...
## $ Period <fct> YEAR, YEAR, YEAR, YEAR, YEAR, YEAR, YEAR, YEA...
## $ Week.Ending <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Geo.Level <fct> COUNTY, COUNTY, COUNTY, COUNTY, COUNTY, COUNT...
## $ State <fct> ALABAMA, ALABAMA, ALABAMA, ALABAMA, ALABAMA, ...
## $ State.ANSI <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
## $ Ag.District <fct> BLACK BELT, BLACK BELT, BLACK BELT, BLACK BEL...
## $ Ag.District.Code <int> 40, 40, 40, 40, 40, 40, 40, 50, 50, 50, 50, 5...
## $ County <fct> AUTAUGA, BULLOCK, DALLAS, ELMORE, MACON, OTHE...
## $ County.ANSI <int> 1, 11, 47, 51, 87, NA, 105, 3, 35, 53, 99, NA...
## $ Zip.Code <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Region <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ watershed_code <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ Watershed <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
## $ Commodity <fct> CORN, CORN, CORN, CORN, CORN, CORN, CORN, COR...
## $ Data.Item <fct> CORN, GRAIN - YIELD, MEASURED IN BU / ACRE, C...
## $ Domain <fct> TOTAL, TOTAL, TOTAL, TOTAL, TOTAL, TOTAL, TOT...
## $ Domain.Category <fct> NOT SPECIFIED, NOT SPECIFIED, NOT SPECIFIED, ...
## $ Value <dbl> 170.8, 135.0, 149.6, 153.0, 151.3, 150.8, 122...
## $ CV.... <lgl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, N...
Gross. USDA data is not always in the best format. What we want to do in this example is (1) make a unique identifier for each county by merging the County.ANSI
and State.ANSI
columns (these are the state and county FIPS codes used by many federal and state agencies in the US). (2) Then we want to drop anything not at the county scale (County.ANSI == NA
), (3) then we want to select only the columns we care about, say Year
, our new county ID column (GEOID
), and Value
, which is the actual value of yield in bushels/acre. Finally, we want to add a new column to each data.frame
indicating the crop it represents, so we can keep track of this when we merge the three data.frames
. We want to do this for each crop, and then merge the three datasets. Let’s start by thinking about the cleaning of the data. I tend to start by writing comments indicating each step I need to take:
# load data
# drop weird counties
# create GEOID column
# select important variables
Then I populate the area below the comments with the code needed to complete each task. Once we get this code working, we’ll wrap it into a function. The trick/art of writing functions is to write code so that it will work in the same way on multiple inputs (e.g. corn.csv
, wwheat.csv
, soy.csv
). This can cause some frustration initially, but once you get the hang of this, you’ll be a function-writing-machine:
# load data
df <- read.csv("./data/corn.csv")
# drop non-counties
df <- df %>%
filter(!is.na(County.ANSI))
# create GEOID column
df <- df %>% # since I want GEOID to be five digits CCSSS, I need to add pads to the strings, so that 1 becomes 01.
mutate(State.ANSI = str_pad(State.ANSI, width = 2, side = "left", pad = "0"),
County.ANSI = str_pad(County.ANSI, width = 3, side = "left", pad = "0"),
GEOID = paste0(State.ANSI, County.ANSI))
# select important variables
df <- df %>%
select(Year, GEOID, Value)
Now, when you see codes with multiple pipes in multiple places, consider writing one big pipe, like so:
# load data
df <- read.csv("./data/corn.csv")
# create GEOID column AND select key columns
df <- df %>% # since I want GEOID to be five digits CCSSS, I need to add pads to the strings, so that 1 becomes 01.
filter(!is.na(County.ANSI)) %>%
mutate(State.ANSI = str_pad(State.ANSI, width = 2, side = "left", pad = "0"),
County.ANSI = str_pad(County.ANSI, width = 3, side = "left", pad = "0"),
GEOID = paste0(State.ANSI, County.ANSI)) %>%
select(Year, GEOID, Value)
glimpse(df)
## Observations: 42,005
## Variables: 3
## $ Year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20...
## $ GEOID <chr> "01001", "01011", "01047", "01051", "01087", "01105", "0...
## $ Value <dbl> 170.8, 135.0, 149.6, 153.0, 151.3, 122.3, 162.0, 120.7, ...
Now, to make this a FUNCTION, we need to think about the function arguments, or the inputs to the function. For example, if we call the function mean(x)
, the value of x
is the parameter that will change. In our case, the thing that will change is the FILE NAME, i.e. corn.csv
or wwheat.csv
. Therefore, this will be the main input to the function. Note that all of the spreadsheets have the same column names and structure, which is why this functional approach works well.
load_data <- function(file_name) {
# copy and paste the code from above
df <- read.csv(file_name) # but replace the actual filename with the file_name argument
df <- df %>%
filter(!is.na(County.ANSI)) %>%
mutate(State.ANSI = str_pad(State.ANSI, width = 2, side = "left", pad = "0"),
County.ANSI = str_pad(County.ANSI, width = 3, side = "left", pad = "0"),
GEOID = paste0(State.ANSI, County.ANSI)) %>%
select(Year, GEOID, Value) %>%
mutate(CROP = sub(".*data/ *(.*?) *.csv.*", "\\1", file_name)) # some gross regex so we can extract the crop from the filename and add a "CROP" column to our dataset
# don't forget to return something!
return(df)
}
corn <- load_data("./data/corn.csv")
soy <- load_data("./data/soy.csv")
wwheat <- load_data("./data/wwheat.csv")
glimpse(wwheat) # always inspect the results of the function to make sure it's working as planned
## Observations: 7,597
## Variables: 4
## $ Year <int> 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 2017, 20...
## $ GEOID <chr> "05037", "05077", "05095", "05107", "05147", "05111", "0...
## $ Value <dbl> 65.6, 47.9, 63.1, 51.5, 68.8, 66.6, 63.3, 44.4, 45.8, 72...
## $ CROP <chr> "wwheat", "wwheat", "wwheat", "wwheat", "wwheat", "wwhea...
Awesome! Now we have three datasets, corn
, soy
, and wwheat
that all have the same format. The next step is to merge the data. Here we want to build a function that takes as an argument a list of .data.frames
and squishes them into a single data.frame
, in our case, merging by Year
and GEOID
:
df_list <- list(corn, soy, wwheat)
merge_data <- function(df_list) {
out <- bind_rows(df_list) # bind the ros
out <- spread(out, CROP, Value) # tidy the data so each crop has its own column
return(out)
}
final_df <- merge_data(df_list = df_list)
glimpse(final_df)
## Observations: 91,534
## Variables: 5
## $ Year <int> 1927, 1927, 1927, 1927, 1927, 1927, 1927, 1927, 1927, 1...
## $ GEOID <chr> "17001", "17005", "17007", "17009", "17011", "17013", "...
## $ corn <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
## $ soy <dbl> 12, 9, 13, 15, 14, 10, 11, 13, 15, 10, 11, 7, 10, 11, 7...
## $ wwheat <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,...
Now let’s create a tool that plots yield through time for these crops. Note that there’s a LOT of missing yield data in the dataset:
plot_yield <- function(df, crop) {
df_sub <- df %>%
select(Year, GEOID, crop) %>% # select only yield for the crop of interest
drop_na() # drop rows with missing yield data, don't wanna plot that
colnames(df_sub) <- c("Year", "GEOID", "Yield") # to help with plotting
myplot <- ggplot(data = df_sub) + # you'll need to save the plot in a variable so we can return that variable
geom_boxplot(aes(x = Year, y = Yield, group=Year)) +
geom_smooth(aes(x = Year, y = Yield), method="lm")
return(myplot)
}
plot_yield(final_df, crop = "corn")
plot_yield(final_df, crop = "soy")
plot_yield(final_df, crop = "wwheat")
Here’s why this is cool. Now we can run the ENTIRE analysis from scratch calling only the following:
corn <- load_data("./data/corn.csv")
soy <- load_data("./data/soy.csv")
wwheat <- load_data("./data/wwheat.csv")
final_df <- merge_data(df_list = list(corn, soy, wwheat))
plot_yield(final_df, crop = "corn")
plot_yield(final_df, crop = "soy")
plot_yield(final_df, crop = "wwheat")
Reads like a book. And if you have questions about what a particular function is actually doing, you just have to go find that function and inspect how it was built. This makes it much easier to build far more complex chunks of code without miles of code in your file. Ideally, you’d store all of the functions you build in a func.R
file (see below), and then call source(func.R)
at the top of your script. This loads all of the functions you’ve built for use in your current working environment! Try it!
For most of us, it takes a few iterations of getting things wrong to figure out how to get things right (life lesson for you). I’ll share what I do to organize my data and code. I encourage you to think through a structure that works for you.
On my computer, I have a big folder or “Working Folders”. In this folder, I have a separate folder for each project I’m currently working on, i.e. “Cool project on agricultural adaptation” and “That cool project in Sri Lanka”, etc. In each working folder, I organize things in the following folders:
data
- ALL data goes here, ideally in read-only form so I can’t mess up my original data.lit
- any relevant literature/references go here, though I try to store most of this in a reference manager (Zotero, Mendeley).figs
- visualization outputs from analyses are saved out here.writing
- article drafts, methods documentation, etc, all goes here, and often also goes in my RMarkdown
doc that I describe below.archive
- old things I don’t think I need anymore, I dump here. Don’t delete! You never know when you may need things again! (note that if you use something like Git for version control, this isn’t really necessary)out
for results, like summaries of models or key data structures I want to saveI keep my code organized in four files for each big projects (taken from this fantastic post):
load.R
clean.R
func.R
do.R
load.R
: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I’ll either write out the workspace using save() or just keep things in memory for the next step.
clean.R
: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.
func.R
: Contains all of the functions needed to perform the actual analysis. source()
ing loads all of the functions for use in other scripts. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.
do.R
: Calls the functions defined in func.R
to perform the analysis and produce charts and tables.
This will make more sense as we move through the semester.
Then, most importantly, for each project I maintain a RMarkdown
document that explores my data and documents key analyses. This is a great way to keep track of how and why you do specific analyses, how you clean and transform your data, and what your results look like as you move through your project. What’s RMarkdown
you say? Read on, young samurai. Sometimes one RMarkdown
document will do. Other times, for big projects, I create several documents. Think of these documents as a way to record your decision process. The final code can be stored in the main R
scripts, but there’s a lot of thinking and playing that goes into these final scripts, and it’s important you document this somewhere!
CITE your data! Cite R too! Keep all of your data files in the same directory as your code, then use relative paths (.
) to access your data, i.e.
my_cool_data <- read.csv(file = "./data/my_awesome_data.csv")
Rather than something that looks like this:
my_cool_data <- read.csv(file = "/Users/Emily/Nowyouknow/what/my/filestructure/lookslike/mydata.csv")
You can learn more about managing directors and your RStudio
workspace here. You can read about some of the most commonly used functions to import data here.
When I start a coding project, I normally get a pencil and paper and draw out the process… this process normally culminates in a wonky flow chart documenting each major step in the process. I then write this in “pseudocode” in a R
script… this could look as simple as this:
# clean datasets
# run interpolation prodedure
# calibrate models
# project values
# summarize and visualize results
…though often it’s a bit more detailed. Once you outline the process, you can start to write functions and code snips that actually implement the analysis. If you use my code organization system, all of your functions will be stored in func.R
and all of your data wrangling in clean.R
. It’s very helpful to start with the end in mind so you can organize scripts and avoid overlooking major parts of the analysis. As you write your code, COMMENT, COMMENT, COMMENT! COMMENT! If I write a chunk of code that looks like this:
library(sp)
house1.building <- Polygon(rbind(c(1, 1), c(2, 1), c(2, 0), c(1, 0)))
house1.roof <- Polygon(rbind(c(1, 1), c(1.5, 2), c(2, 1)))
house2.building <- Polygon(rbind(c(3, 1), c(4, 1), c(4, 0), c(3, 0)))
house2.roof <- Polygon(rbind(c(3, 1), c(3.5, 2), c(4, 1)))
house2.door <- Polygon(rbind(c(3.25, 0.75), c(3.75, 0.75), c(3.75, 0), c(3.25,
0)), hole=T)
h1 <- Polygons(list(house1.building, house1.roof), "H1")
h2 <- Polygons(list(house2.building, house2.roof, house2.door), "H2")
houses <- SpatialPolygons(list(h1, h2))
plot(houses)
Do you have any clue what’s going on? Or does this make you feel a bit sick? Yeah. If you add COMMENTS to your code (done in R
with the #
sign), then things are much easier to follow. Combine this with embedding code in NARRATIVE using RMarkdown
, and your code will be easy enough for a five-year old to understand (OK, a really really smart one). Even if you don’t understand the details, you can get the general sense of what I’m doing below (and in a few weeks, do this yourself):
# load the sp package
library(sp)
# create a list of points to draw a house polygon
house1.building <- Polygon(rbind(c(1, 1), c(2, 1), c(2, 0), c(1, 0)))
# create a list of points to draw a roof polygon
house1.roof <- Polygon(rbind(c(1, 1), c(1.5, 2), c(2, 1)))
# repeat for a different house/roof combination
house2.building <- Polygon(rbind(c(3, 1), c(4, 1), c(4, 0), c(3, 0)))
house2.roof <- Polygon(rbind(c(3, 1), c(3.5, 2), c(4, 1)))
# add a door to house.2
house2.door <- Polygon(rbind(c(3.25, 0.75), c(3.75, 0.75), c(3.75, 0), c(3.25,
0)), hole=T)
# connect the roof and building of house 1 into a single polygon
h1 <- Polygons(list(house1.building, house1.roof), "H1")
# connect the roof, building, and door of house 2 into a single polyton
h2 <- Polygons(list(house2.building, house2.roof, house2.door), "H2")
# combine both polygons into a single "houses" shapefile
houses <- SpatialPolygons(list(h1, h2))
# plot our new polygon
plot(houses)
I always start my code by being explicit about the packages and dependencies the code needs. This means I list the packages I need to use in the script at the very top of the page.
library(ggplot2)
library(dplyr)
library(sp)
I also list the data input files at the top of my scripts, e.g.
setwd(".")
infile <- "./data/myfile.csv"
indata <- read.csv(infile)
Also be careful how you name variables, infile
and inFile
are NOT the same. Read more here.
Finally, BACK UP YOUR DATA and consider using version control resources like Github. You can find great Github tutorials here and here. More info/recommendations on version control here.
R
can run into memory issues. It is a common problem to run out of memory after running R
scripts for a long time. To inspect the objects in your current R
environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long lines of computationally intensive code is to remove temporary objects after they have served their purpose (rm()
). However, sometimes, R
will not clean up unused memory for a while after you delete objects. You can force R
to tidy up its memory by using gc()
. You can figure out which objects are in your current workspace by calling ls()
. You can programmatically delete all objects in your workspace by calling rm(list = ls())
, but be sure you’re ready to really remove all objects before you use this approach. You can read more about memory management here.
Wherever possible, keep track of sessionInfo()
somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and re-install the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived).
Other recommendations on coding best practices hereand here. This page provides lots of links to recommendations from the R
community. This is a more detailed overview and this text covers Advanced R and project management.