When I first open a .R
or .Rmd
file in RStudio, my very first step is to tell R
where I am on my machine. I do this by clicking Session in the menu at the top of my screen, then Set Working Directory, then To Source File Location. This tells RStudio
to set the “working directory”, or the directory where RStudio
will look to find files or to save files, to the directory in which my script is located. In this directory, I tend to have a data
sub-folder that stores my data for the project. My R
script sits outside of this data
folder in the main folder, so my directory for each project (or each week of work in your case) looks something like this:
With this file structure, once I set my working directory to Source File Location (which just calls the setwd()
function), I can easily load files stored in my data
folder without typing the long, annoying directory of the full directory for the data. For example, I can load data in the folder shown above using the following:
my_data <- read.table("./data/precip.txt", sep = ",", header=T)
Here, instead of writing the full directory, e.g. "C:/Users/..."
the period (.
) basically tells R
to input the full working directory of my script, then go into the data
folder and open the file called precip.txt
. I use this file structure for all of my projects and recommend you do the same. It makes life much easier. If you’re curious about what R
sees as your working directory, call getwd()
in the console.
read.table()
is a function. Functions are always followed by parentheses (()
) because they take arguments. In the case of read.table()
the arguments include the directory, "./data/precip.txt"
and a header
argument which tells the read.table()
function whether the precip.txt
data has a header, i.e. a row with column names. Since precip.txt
does have a header, we tell read.table()
that header=T
or header is TRUE. The sep
argument tell the read.table()
function that this file is a comma separated file, i.e. each entry is separated with commas rather than spaces or some other symbol.
If you don’t know what arguments a function requires, you can ask R
by typing ?
and the name of the function in the Console, i.e. ?read.table
. This pops up documentation for the function in the Help
pane (bottom right of RStudio) that includes a list of arguments to put in the function as well as some examples of the function in use. Note that you don’t have to put ALL the arguments in a function, i.e. in the case of read.table()
we don’t specify stringsAsFactors=F
, etc… any arguments you don’t specify revert to the default values describe in the function documentation. Only add arguments when you want to override these defaults.
Base R
, or the original R
interface you installed on your machine includes many functions such as read.table()
, mean()
, head()
and others. The power of R
comes from using functions built by other users for specific purposes. These functions have to be installed manually on your machine. These functions live in packages. To install a package, use the install.packages()
function. For example, if you want to install the tidyverse
suite of packages we’ll use a lot in this class, you’d type the following in your console:
install.packages("tidyverse") # be sure to use quotes
This installs the tidyverse
on your machine. To use a package in a session in R
, you need to load that package into the current session using the library()
function. For example, if you want to access functions in the tidyverse package, you’d need to load that package at the head of your RMarkdown or R script:
library(tidyverse) # no quotes needed
Did you notice the # no quotes needed
comment? You can add comments anywhere in your code by placing a hash tag, #
, in front of text:
# this is a comment, it will not run as code
# anything without a leading hashtag will run as code, i.e.
print("HEY GUYS")
## [1] "HEY GUYS"
The simplest data form in R
is a variable. A variable stores a single value (though in the future we’ll refer to columns in a spreadsheet as variables). Let’s make some variables:
x <- 1
y <- 2
We’ve made two variables, x
and y
. When you run this code, you’ll see these variables pop up in your Global Environment pane in the top right of your screen.
Variable names are case sensitive (so X
is different from x
) and cannot be numbers. The <-
symbol is meant to be an arrow pointing to the right. You can also use an equals sign, so x <- 1
is the same as x = 1
. The <-
is used my many folks in the R
community (including myself) for historical and habitual reasons.
What happens if you run the following:
x <- y
print(x)
## [1] 2
We have replaced the original value of x
, 1
, with the value of y
, 2. When you assign new values to existing variables, you overwrite the original value, so be careful!
Vectors are multiple elements stored in a single object. For example:
x <- c(1, 1, 3, 5)
print(x)
## [1] 1 1 3 5
x
is a vector of numbers. We know this first because we can see it’s composed of numbers but also because the class()
(or data type) of the new vector is numeric
:
class(x)
## [1] "numeric"
We can also make vectors that contain character data, or strings:
y <- c("Hey", "Hola", "Bonjour", "Yo")
print(y)
## [1] "Hey" "Hola" "Bonjour" "Yo"
class(y)
## [1] "character"
The vector is of class character
, a class used to indicate text.
We can index vectors using square brackets and numbers indicating the location of an element. For example, if we want to know the third element in the y
vector, we’d do the following:
y[3]
## [1] "Bonjour"
What if you want to create a vector of both character and numeric data? Short answer is you can’t. This is where lists come in. A list can store lots of different types of data, e.g:
my_list <- list("Hey", 1, 2, "Hola")
print(my_list)
## [[1]]
## [1] "Hey"
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] "Hola"
We can even make lists of lists:
my_list2 <- list(1, 3, "Dog")
big_list <- list(my_list, my_list2)
print(big_list)
## [[1]]
## [[1]][[1]]
## [1] "Hey"
##
## [[1]][[2]]
## [1] 1
##
## [[1]][[3]]
## [1] 2
##
## [[1]][[4]]
## [1] "Hola"
##
##
## [[2]]
## [[2]][[1]]
## [1] 1
##
## [[2]][[2]]
## [1] 3
##
## [[2]][[3]]
## [1] "Dog"
If we want to extract the first list from our big_list
, we’d call:
big_list[[1]]
## [[1]]
## [1] "Hey"
##
## [[2]]
## [1] 1
##
## [[3]]
## [1] 2
##
## [[4]]
## [1] "Hola"
If we want the second element of the first list, we’d call:
big_list[[1]][[2]]
## [1] 1
We won’t work a lot with lists in the class, but they are great to know about. Instead we’ll work quite a bit with data.frames
, which I’ll talk about next.
data.frame
data.frames
are another fantastic tool to organize lots of different types of data, and in my opinion are more intuitive than lists. They work a bit like an Excel spreadsheet, with columns indicating different variables and rows indicating observations. Columns can be different data types, so we can zip a vector of string data and a vector of numeric data into a single data.frame
. Let’s squish our x
and y
vectors into a data.frame
. The easiest way to do this is using the data.frame()
function:
df <- data.frame(x, y)
print(df)
## x y
## 1 1 Hey
## 2 1 Hola
## 3 3 Bonjour
## 4 5 Yo
Cool! You can also view df
in a separate window by typing View(df)
in the Console. If you want more control over the shape of your new data.frame
, check out cbind.data.frame()
and rbind.data.frame()
.
You can index a data.frame
using square brackets. Say you want the element in row 1, column 2? We can index this as follows:
df[1,2] # python users, not that this is row column NOT column row indexing
## [1] Hey
## Levels: Bonjour Hey Hola Yo
We can also index using the dollar sign:
df$x
## [1] 1 1 3 5
This selects the entire column x
.
Finally, we can index using the actual column names rather than the index number:
df[,"x"]
## [1] 1 1 3 5
Nice! Here are a few other useful data.frame
hacks… to get the list of unique elements in a column (or vector):
unique(df$x)
## [1] 1 3 5
To get the dimensions of a data.frame
try these:
length(df$x)
## [1] 4
nrow(df)
## [1] 4
ncol(df)
## [1] 2
dim(df) # row, column
## [1] 4 2
data.frames
are rad and we’ll spend most of the class working with these. We’ll use dplyr
to do most of our indexing of data.frames
, but subset()
is a useful function to know about. Say you want to find the rows in df
where x
is equal to one:
subset(df, x == 1)
## x y
## 1 1 Hey
## 2 1 Hola
Why the ==
? The ==
is like a TRUE/FALSE statement… for example:
z <- 1
z == 1
## [1] TRUE
We can read z == 1
like the question “Is z equal to one?”. It is, so R
returns TRUE
. On the other hand:
z == 2
## [1] FALSE
When using subset()
(and dplyr
further down the line), we’ll use the ==
for filtering. Just be sure to remember that ==
tests for relationships, normally returning TRUE
and FALSE
. =
assigns values, in this case replacing v
with v2
.
Let’s go back to our precipitation data and use our new data.frame
skillz to explore this dataset:
my_data <- read.table("./data/precip.txt", sep = ",", header=T)
We can look at the full data.frame
by typing View(my_data)
in the console. We can look at the first few rows using the head()
function and the final rows with the tail()
function:
head(my_data)
## ID NAME LAT LONG ALT JAN FEB MAR APR MAY JUN JUL
## 1 ID741 DEATH VALLEY 36.47 -116.87 -59 7.4 9.5 7.5 3.4 1.7 1.0 3.7
## 2 ID743 THERMAL/FAA AIRPORT 33.63 -116.17 -34 9.2 6.9 7.9 1.8 1.6 0.4 1.9
## 3 ID744 BRAWLEY 2SW 32.96 -115.55 -31 11.3 8.3 7.6 2.0 0.8 0.1 1.9
## 4 ID753 IMPERIAL/FAA AIRPORT 32.83 -115.57 -18 10.6 7.0 6.1 2.5 0.2 0.0 2.4
## 5 ID754 NILAND 33.28 -115.51 -18 9.0 8.0 9.0 3.0 0.0 1.0 8.0
## 6 ID758 EL CENTRO/NAF 32.82 -115.67 -13 9.8 1.6 3.7 3.0 0.4 0.0 3.0
## AUG SEP OCT NOV DEC
## 1 2.8 4.3 2.2 4.7 3.9
## 2 3.4 5.3 2.0 6.3 5.5
## 3 9.2 6.5 5.0 4.8 9.7
## 4 2.6 8.3 5.4 7.7 7.3
## 5 9.0 7.0 8.0 7.0 9.0
## 6 10.8 0.2 0.0 3.3 1.4
tail(my_data)
## ID NAME LAT LONG ALT JAN FEB MAR APR MAY JUN
## 451 ID42173 HUNTINGTON-LAKE 37.23 -119.21 2140 165 160 151 84 32 9
## 452 ID42992 TWIN-LAKES 38.70 -120.03 2438 210 173 169 104 52 29
## 453 ID43093 BISHOP-CREEK-INTAKE-2 37.25 -118.58 2485 57 42 33 21 15 11
## 454 ID43574 GEM-LAKE 37.75 -119.13 2734 86 70 71 41 19 15
## 455 ID43616 LAKE-SABRINA 37.21 -118.61 2763 75 59 51 33 16 10
## 456 ID43770 ELLERY-LAKE 37.93 -119.23 2940 110 85 77 40 23 16
## JUL AUG SEP OCT NOV DEC
## 451 5 6 28 48 117 132
## 452 15 24 35 72 183 191
## 453 11 15 17 12 37 45
## 454 16 19 23 24 70 77
## 455 11 12 20 18 50 58
## 456 21 20 23 35 91 96
We can index a column using the $
, i.e. my_data$ID
.
If you look at my_data
in the Global Environment and click on the blue arrow, you’ll see lots of information about each column in the dataset, including the class
(e.g. Factor
, num
, chr
), the number of elements (rows) in each vector, and some examples of what the content of each vector looks like. Wait, what the heck is a factor
?
Factors are frequently used to store categorical data like identifiers and groupings. For example, imaging you’re working with Census data and have a data.frame
with county-level data grouped into states. The State
column should be stored as a factor
since it groups the county-level data. If there is an error in the data set and North Carolina
is spelled as Norht Carolina
one time, the incorrectly spelled NC will be a whole separate factor. I think of factors as calling the unique()
function on a vector and storing each unique entry as a level. The utility of factors will become more apparent as we move through the class. In our dataset, ID
and NAME
are stored as factors. If we didn’t want them to be factors and instead wanted them to be character
vectors, we could add the stringsAsFactors=F
argument to read.table()
.
my_data <- read.table("./data/precip.txt", sep = ",", header=T, stringsAsFactors = F)
You can change and/or add columns to a data.frame
as follows:
my_data$ALT10 <- my_data$ALT + 10 # creates a new column called ALT10 that is ALT plus 10
my_data$LAT <- NA # replaces ALL values of the LAT column with NA, be careful when you overwrite columns!!
my_data$JUNJUL <- my_data$JUN + my_data$JUL # new column that sums precipitation in June and July
R
also makes it VERY easy to quickly visualize your data. We’ll learn how to make visualizations with ggplot2
in this class, but you can also make pretty sweet visualizations with base R
.
# histogram of precip in June
hist(my_data$JUN, main = "JUNE TEMPS")
# visualization of the relationship between June precip and altitude
plot(my_data$JUN, my_data$ALT, xlab = "June temperature", ylab = "Altitude", main = "Temp and elevation")
Many of the errors you encounter as you code will be associated with mismatch between the data type you’re working with and the data type the function/tools you’re working with would like to have. If you feed a character
dataset to a function like mean()
that’s expecting a vector of numbers, R
will become confused:
my_data <- c("Hey", "there", "dude")
mean(my_data)
## Warning in mean.default(my_data): argument is not numeric or logical: returning
## NA
## [1] NA
It’s good practice to check to see how R
understands your data - how does R
understand the my_data
list I just made?
class(my_data)
## [1] "character"
The class()
function returns information about how R
is understanding the my_data
list. It says, “Hey, I think this list is a bunch of character
entries” - which sounds right, since the character
class is indicative of letters, words, and text - also called strings
. You can also ask R
what the mode
of the list is:
mode(my_data)
## [1] "character"
In this case, mode()
also returns character
. The difference between the mode
and class
of an object in R
is subtle and not essential to this class, but here’s a basic explanation, taken from this great overview:
mode
reflects the basic structure of an object in R
. Modes include numeric
, character
like the list above, complex
, and logical
. Modes can also be lists
or functions
. Each object only has one mode.class
is an object property that determine how the object “plays” with other functions. A data.frame
is a class. When we get to spatial data, you’ll see that S4
objects are classes (i.e. a specific data configuration required for many spatial functions to work)Let’s look at a few examples to really understand data types in R. OK, let’s start by making a variable called x
that is equal to 5
:
x <- 5
print(x)
## [1] 5
Now let’s create a vector called v
that is equal to three numbers, 5, 10, 20
:
v <- c(5, 10, 20)
print(v)
## [1] 5 10 20
c()
, short for concatenate, basically zips the three numbers into a vector. We could also do this:
x <- 5
y <- 10
z <- 20
v2 <- c(x,y,z)
print(v2)
## [1] 5 10 20
Are the two vectors the same?
v == v2
## [1] TRUE TRUE TRUE
Each element (number) in these vectors are equal, TRUE
. What happens if we only use ONE equals sign?
v = v2
CAREFUL. This replaces the values in v
with the values of v2
. Remember that ==
tests for relationships, normally returning TRUE
and FALSE
. =
assigns values, in this case replacing v
with v2
.