For most of us, it takes a few iterations of getting things wrong to figure out how to get things right (life lesson for you). I’ll share what I do to organize my data and code. I encourage you to think through a structure that works for you.
On my computer, I have a big folder or “Working Folders”. In this folder, I have a separate folder for each project I’m currently working on, i.e. “Cool project on agricultural adaptation” and “That cool project in Sri Lanka”, etc. In each working folder, I organize things in the following folders:
data
- ALL data goes here, ideally in read-only form so I can’t mess up my original data.lit
- any relevant literature/references go here, though I try to store most of this in a reference manager (Zotero, Mendeley).figs
- visualization outputs from analyses are saved out here.writing
- article drafts, methods documentation, etc, all goes here, and often also goes in my RMarkdown
doc that I describe below.archive
- old things I don’t think I need anymore, I dump here. Don’t delete! You never know when you may need things again! (note that if you use something like Git for version control, this isn’t really necessary)out
for results, like summaries of models or key data structures I want to saveI keep my code organized in four files for each big projects (taken from this fantastic post):
load.R
clean.R
func.R
do.R
load.R
: Takes care of loading in all the data required. Typically this is a short file, reading in data from files, URLs and/or ODBC. Depending on the project at this point I’ll either write out the workspace using save() or just keep things in memory for the next step.
clean.R
: This is where all the ugly stuff lives - taking care of missing values, merging data frames, handling outliers.
func.R
: Contains all of the functions needed to perform the actual analysis. source()
ing loads all of the functions for use in other scripts. This means that you can modify this file and reload it without having to go back an repeat steps 1 & 2 which can take a long time to run for large data sets.
do.R
: Calls the functions defined in func.R
to perform the analysis and produce charts and tables.
This will make more sense as we move through the semester.
Then, most importantly, for each project I maintain a RMarkdown
document that explores my data and documents key analyses. This is a great way to keep track of how and why you do specific analyses, how you clean and transform your data, and what your results look like as you move through your project. What’s RMarkdown
you say? Read on, young samurai. Sometimes one RMarkdown
document will do. Other times, for big projects, I create several documents. Think of these documents as a way to record your decision process. The final code can be stored in the main R
scripts, but there’s a lot of thinking and playing that goes into these final scripts, and it’s important you document this somewhere!
CITE your data! Cite R too! Keep all of your data files in the same directory as your code, then use relative paths (.
) to access your data, i.e.
my_cool_data <- read.csv(file = "./data/my_awesome_data.csv")
Rather than something that looks like this:
my_cool_data <- read.csv(file = "/Users/Emily/Nowyouknow/what/my/filestructure/lookslike/mydata.csv")
You can learn more about managing directors and your RStudio
workspace here. You can read about some of the most commonly used functions to import data here.
When I start a coding project, I normally get a pencil and paper and draw out the process… this process normally culminates in a wonky flow chart documenting each major step in the process. I then write this in “pseudocode” in a R
script… this could look as simple as this:
# clean datasets
# run interpolation procedure
# calibrate models
# project values
# summarize and visualize results
…though often it’s a bit more detailed. Once you outline the process, you can start to write functions and code snips that actually implement the analysis. If you use my code organization system, all of your functions will be stored in func.R
and all of your data wrangling in clean.R
. It’s very helpful to start with the end in mind so you can organize scripts and avoid overlooking major parts of the analysis. As you write your code, COMMENT, COMMENT, COMMENT! COMMENT! If I write a chunk of code that looks like this:
library(sp)
house1.building <- Polygon(rbind(c(1, 1), c(2, 1), c(2, 0), c(1, 0)))
house1.roof <- Polygon(rbind(c(1, 1), c(1.5, 2), c(2, 1)))
house2.building <- Polygon(rbind(c(3, 1), c(4, 1), c(4, 0), c(3, 0)))
house2.roof <- Polygon(rbind(c(3, 1), c(3.5, 2), c(4, 1)))
house2.door <- Polygon(rbind(c(3.25, 0.75), c(3.75, 0.75), c(3.75, 0), c(3.25,
0)), hole=T)
h1 <- Polygons(list(house1.building, house1.roof), "H1")
h2 <- Polygons(list(house2.building, house2.roof, house2.door), "H2")
houses <- SpatialPolygons(list(h1, h2))
plot(houses)
Do you have any clue what’s going on? Or does this make you feel a bit sick? Yeah. If you add COMMENTS to your code (done in R
with the #
sign), then things are much easier to follow. Combine this with embedding code in NARRATIVE using RMarkdown
, and your code will be easy enough for a five-year old to understand (OK, a really really smart one). Even if you don’t understand the details, you can get the general sense of what I’m doing below (and in a few weeks, do this yourself):
# load the sp package
library(sp)
# create a list of points to draw a house polygon
house1.building <- Polygon(rbind(c(1, 1), c(2, 1), c(2, 0), c(1, 0)))
# create a list of points to draw a roof polygon
house1.roof <- Polygon(rbind(c(1, 1), c(1.5, 2), c(2, 1)))
# repeat for a different house/roof combination
house2.building <- Polygon(rbind(c(3, 1), c(4, 1), c(4, 0), c(3, 0)))
house2.roof <- Polygon(rbind(c(3, 1), c(3.5, 2), c(4, 1)))
# add a door to house.2
house2.door <- Polygon(rbind(c(3.25, 0.75), c(3.75, 0.75), c(3.75, 0), c(3.25,
0)), hole=T)
# connect the roof and building of house 1 into a single polygon
h1 <- Polygons(list(house1.building, house1.roof), "H1")
# connect the roof, building, and door of house 2 into a single polyton
h2 <- Polygons(list(house2.building, house2.roof, house2.door), "H2")
# combine both polygons into a single "houses" shapefile
houses <- SpatialPolygons(list(h1, h2))
# plot our new polygon
plot(houses)
I always start my code by being explicit about the packages and dependencies the code needs. This means I list the packages I need to use in the script at the very top of the page.
library(ggplot2)
library(dplyr)
library(sp)
I also list the data input files at the top of my scripts, e.g.
setwd(".")
infile <- "./data/myfile.csv"
indata <- read.csv(infile)
Also be careful how you name variables, infile
and inFile
are NOT the same. Read more here.
Finally, BACK UP YOUR DATA and consider using version control resources like Github. You can find great Github tutorials here and here. More info/recommendations on version control here.
R
can run into memory issues. It is a common problem to run out of memory after running R
scripts for a long time. To inspect the objects in your current R
environment, you can list the objects, search current packages, and remove objects that are currently not in use. A good practice when running long lines of computationally intensive code is to remove temporary objects after they have served their purpose (rm()
). However, sometimes, R
will not clean up unused memory for a while after you delete objects. You can force R
to tidy up its memory by using gc()
. You can figure out which objects are in your current workspace by calling ls()
. You can programmatically delete all objects in your workspace by calling rm(list = ls())
, but be sure you’re ready to really remove all objects before you use this approach. You can read more about memory management here.
Wherever possible, keep track of sessionInfo()
somewhere in your project folder. Session information is invaluable because it captures all of the packages used in the current project. If a newer version of a package changes the way a function behaves, you can always go back and re-install the version that worked (Note: At least on CRAN, all older versions of packages are permanently archived).
Other recommendations on coding best practices hereand here. This page provides lots of links to recommendations from the R
community. This is a more detailed overview and this text covers Advanced R and project management.