R
opens, stores, and organizes shapefiles.R
handles projection.dplyr
a shapefile.R
.Hooray! We’ve finally made it to my favorite type of data: spatial data! Well, actually nearly all of the data we’ve already played contains some sort of spatial information (country names, state names, etc…), it just did not contain the coordinate and projection information required to make maps and conduct spatial analyses. This tutorial will focus on giving you the tools required to import, project, and create spatial data. There are several packages used by the R
community to bring spatial data into R
. These are:
sp
rgdal
sf
tidyverse
Make sure you have these packages installed and loaded on your machine.
library(sp) # old school - don't really need this package, just loading for reference
library(sf)
library(spData) # a bonus package that has some cool shapefiles
library(ggspatial) # if you want to boost your ggplot mapping skillz
# plus, as always:
library(tidyverse)
R
First, some history…
sf
(simple features) is the new go-to package for spatial analysis in R
, replacing the sp
package of yore.1 sf
is a descendant of the sp
package, made by the same folks, and has good inter-operability with sp
. You can look through the sf
and sp
documentation to get a sense of all the things you can do with these powerful packages. The team behind the sp
package initially created a set of new data classes that integrate our best friend the data.frame
with some friends who at times can be very difficult to hang out with: point, polygon, and raster. These new data classes include the following:
getClass("Spatial")
## Class "Spatial" [package "sp"]
##
## Slots:
##
## Name: bbox proj4string
## Class: matrix CRS
##
## Known Subclasses:
## Class "SpatialPoints", directly
## Class "SpatialMultiPoints", directly
## Class "SpatialGrid", directly
## Class "SpatialLines", directly
## Class "SpatialPolygons", directly
## Class "SpatialPointsDataFrame", by class "SpatialPoints", distance 2
## Class "SpatialPixels", by class "SpatialPoints", distance 2
## Class "SpatialMultiPointsDataFrame", by class "SpatialMultiPoints", distance 2
## Class "SpatialGridDataFrame", by class "SpatialGrid", distance 2
## Class "SpatialLinesDataFrame", by class "SpatialLines", distance 2
## Class "SpatialPixelsDataFrame", by class "SpatialPoints", distance 3
## Class "SpatialPolygonsDataFrame", by class "SpatialPolygons", distance 2
sf
on the other hand has a single sf
class that incorporates a special geometry column into a dataframe. This geometry column can include different types of spatial data (e.g. points, polygons, lines, or multiple types). This is convenient because all sf
objects can basically be treated like data.frames
(PARTY!).
In sp
world, the sub-classes followed by DataFrame
like SpatialPointsDataFrame
, include additional attributes in the form of a data.frame
. For example, if you have a SpatialPolygon
object of US Census blocks, then this object would only include the spatial information required to plot the Census block polygons. With a SpatialPolygonDataFrame
, you can store this spatial information and additional data such as population, race, and income associated with each polygon. This may seem trivial, but, as I’ll show you below, this makes for an incredibly powerful way to organize, visualize, and analyze spatial data. In this class, we’ll focus on using sf
objects since this is quickly becoming the norm, but I wanted you to also know about sf
’s cool grandma, sp
.
Enough about the history of spatial packages in R… let’s open some shapefiles! If you’re like me, you have folders full of shapefiles. A single shapefile is often split into many files (e.g. .cpg
, .dbf
, .prj
, etc.), the most important of which is the sacred .SHP
file. To get all of this info into an organized spatial object in R
is easy. As I mentioned before, you can use lots of packages to do this (rgdal
, sf
, and sp
notably), but in this tutorial, we’ll focus on sf
:
states <- st_read("./data/states.shp")
Here, I use the st_read()
function to open my .shp
file. Note that this function automatically finds all of the ancillary files associated with our shapefile (.dbf
, .prj
, etc). If for some reason these files are missing or broken, st_read()
will throw an error, so make sure your shapefile contains all necessary pieces prior to loading in R
!
To make things easier, in this tutorial we’ll mostly use data from the spData
package. It contains some generic shapefiles for use in R
, already formatted in sf
or sp
format and easy to load. Check out the full list here. Today, we’ll work with the us_states
object:
# make sure you've installed and loaded spData first!
data("us_states")
class(us_states)
## [1] "sf" "data.frame"
Excellent. Note that this us_states
object is listed as both a simple feature (sf
) and as a data.frame
. Let’s inspect our data:
glimpse(us_states)
## Rows: 49
## Columns: 7
## $ GEOID <chr> "01", "04", "08", "09", "12", "13", "16", "18", "20", ...
## $ NAME <chr> "Alabama", "Arizona", "Colorado", "Connecticut", "Flor...
## $ REGION <fct> South, West, West, Norteast, South, South, West, Midwe...
## $ AREA [km^2] 133709.27 [km^2], 295281.25 [km^2], 269573.06 [km^2],...
## $ total_pop_10 <dbl> 4712651, 6246816, 4887061, 3545837, 18511620, 9468815,...
## $ total_pop_15 <dbl> 4830620, 6641928, 5278906, 3593222, 19645772, 10006693...
## $ geometry <MULTIPOLYGON [°]> MULTIPOLYGON (((-88.20006 3..., MULTIPOLY...
Note that if you love sp
(full disclosure, I still use sp
from time to time), it’s really easy to coerce this back to a sp
object.
states_sp <- as_Spatial(us_states)
class(us_states)
## [1] "sf" "data.frame"
But for now we’ll stick with sf
. Let’s take a peak at the other info stored in our states
object. We can also access information about the bounding box of the spatial object:
st_bbox(us_states)
## xmin ymin xmax ymax
## -124.70415 24.55868 -66.98240 49.38436
This is a matrix of bounding box coordinates with column names c(min,max)
with the first row showing the longitude (x-axis) and the second row showing the latitude (y-axis).
We can also access projection information:
st_crs(us_states)
## Coordinate Reference System:
## User input: EPSG:4269
## wkt:
## GEOGCS["NAD83",
## DATUM["North_American_Datum_1983",
## SPHEROID["GRS 1980",6378137,298.257222101,
## AUTHORITY["EPSG","7019"]],
## TOWGS84[0,0,0,0,0,0,0],
## AUTHORITY["EPSG","6269"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4269"]]
This slot holds a string that describes the projection of the data, including both the EPSG code and the proj4string format of the projection. Wait, what? Hang on, I’ll go through the projection stuff in more detail below.
Finally, the moment we’ve all been waiting for… let’s plot this baby! For now, we will use base R
to plot this object, but in a future class we’ll learn about more sophisticated mapping tools.
If you just run plot(us_states)
, you’ll get several maps, one for each attribute in the sf
object:
plot(us_states)
To just plot the geometry, you use the st_geometry()
function:
plot(st_geometry(us_states))
And to plot only a single attribute, try this:
plot(us_states["total_pop_15"], main = "Population in 2015")
SO… what the heck is a projection?
Have you ever thought about how a map is two-dimensional but the world is, in fact, three-dimensional? Anytime we visualize our three-dimensional reality2 in two-dimensions, we have to distort (squish, stretch) the real shape of things so they can fit on a piece of paper or a screen. Maps are two-dimensional objects; the Earth is not. The way we portray (pseudo-)spherical Earth on a flat surface is by using a projection. Each projection is associated with a particular coordinate system. A coordinate system is a way to locate everything on Earth’s surface in x and y space. There are a bazillion coordinate systems, and each coordination system comes with strong assumptions and distortions (and fan-base):
12N
;Each coordinate system is tied to a datum. This is essentially a reference point, the starting point from which coordinates are measured (i.e. lat and lon are based on the equator and prime meridian). In R
we can access the projection information for a sf
object using the st_crs()
function you saw above:
st_crs(us_states)
## Coordinate Reference System:
## User input: EPSG:4269
## wkt:
## GEOGCS["NAD83",
## DATUM["North_American_Datum_1983",
## SPHEROID["GRS 1980",6378137,298.257222101,
## AUTHORITY["EPSG","7019"]],
## TOWGS84[0,0,0,0,0,0,0],
## AUTHORITY["EPSG","6269"]],
## PRIMEM["Greenwich",0,
## AUTHORITY["EPSG","8901"]],
## UNIT["degree",0.0174532925199433,
## AUTHORITY["EPSG","9122"]],
## AUTHORITY["EPSG","4269"]]
Excellent. Our data is currently in the NAD83 Geographic Coordinate System which plots locations on a sphere using longitude and latitude with the North American Datum (NAD) 1983 as a reference line (but does not project these coordinates into two-dimensional space). This is fine if we’re only interested in visualizing our data spatially, but if we have any plans to use the spatial data to compute distance or area, then we need to project our data (from 3D to 2D). This is fairly straightforward in R
. Say we want to reproject this data into USA Contiguous Albers Equal Area Conic:
states <- st_transform(us_states, CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs"))
I got that crazy long string by searching for the projection on spatialreference.org and clicking the Proj4 link. R
expects projection information to come in this proj4
format.
If you happen to have a shapefile that has an undefined projection in R
, but you happen to know what projection the data is in, you can set the projection as follows:
shp <- st_read("./path/to/shp.shp")
st_crs(shp) <- CRS("+proj=aea +lat_1=29.5 +lat_2=45.5 +lat_0=37.5 +lon_0=-96 +x_0=0 +y_0=0 +ellps=GRS80 +datum=NAD83 +units=m +no_defs")
People tend to hate projections3 and generally don’t understand them very well. It is, however, VERY important that you project your data correctly. If you fail to do this, it’s highly likely that your spatial analyses will be wrong. So let’s learn how this stuff works once and for all.
Finally, don’t forget that the map projection you use strongly impacts the way you visualize and analyze your data (see here or here).
R
Let’s say I’m a volcanologist and I find a really cool data describing all eruptions on Earth during the Holocene.4 The data is stored in a .csv
document. We now know how to open the .csv
in R
and inspect its contents:
volcano <- read.csv('./data/volcano.csv')
glimpse(volcano)
## Rows: 1,496
## Columns: 13
## $ ï..Volcano.Number <int> 210010, 210020, 210030, 210040, 211001, 211003...
## $ Volcano.Name <chr> "West Eifel Volcanic Field", "Chaine des Puys"...
## $ Country <chr> "Germany", "France", "Spain", "Spain", "Italy"...
## $ Primary.Volcano.Type <chr> "Maar(s)", "Lava dome(s)", "Pyroclastic cone(s...
## $ Activity.Evidence <chr> "Eruption Dated", "Eruption Dated", "Evidence ...
## $ Last.Known.Eruption <chr> "8300 BCE", "4040 BCE", "Unknown", "3600 BCE",...
## $ Region <chr> "Mediterranean and Western Asia", "Mediterrane...
## $ Subregion <chr> "Western Europe", "Western Europe", "Western E...
## $ Latitude <dbl> 50.170, 45.775, 42.170, 38.870, 43.250, 42.600...
## $ Longitude <dbl> 6.850, 2.970, 2.530, -4.020, 10.870, 11.930, 1...
## $ Elevation..m. <int> 600, 1464, 893, 1117, 500, 800, 949, 458, 1281...
## $ Dominant.Rock.Type <chr> "Foidite", "Basalt / Picro-Basalt", "Trachybas...
## $ Tectonic.Setting <chr> "Rift zone / Continental crust (>25 km)", "Rif...
But we’re not here to learn how to work with .csv
data… we have now graduated to spatial data! So how do we turn our .csv
into a sf
object?
volcano_sf <- st_as_sf(volcano, coords=c("Longitude", "Latitude"))
st_crs(volcano_sf) <- "+proj=longlat +ellps=WGS84 +datum=WGS84 +no_defs"
plot(st_geometry(volcano_sf))
What’s happening here? First, we feed the st_as_sf()
function the two things it needs to turn this .csv
into an sf
object: data and coordinates. The latitude and longitude are stored in original .csv
. Coordinates are always listed in format x,y
, so take a second to think about why longitude
is associated with x
(vertical lines, moving along x-axis of a graph) and latitude
with y
(horizontal lines, moving along the y-axis of a graph).
Notice here that I had to define the projection of my data using the prj
variable. I happen to know that this data was unprojected using the WGS84
datum. For your data, you’ll need to rummage through the metadata to figure out what the projection is.
Ohhh how I want to transform this ugly base R
plot into something better. That’s coming. For now, here’s a plot with base R
and some data from the maptools
package:
data("world") # from spData
plot(st_geometry(world))
plot(volcano_sf["Elevation..m."], pch=17, main = "How high those 'canos", add=T) # all the dots from R reading in the csv poorly
We’ll learn how to use ggplot
for sf
plotting magic soon. For now, note how you can plot a single attribute by calling plot(sf["attribute_name"])
. So in this plot, the color of the point represents the elevation of the volcano! The pch
argument changes the shape of the points. I went with volcanic triangles, but there are many other options. Also added a title using the main
argument.
ggplot2
Loading spatial data in R is cool, but spatial data becomes extremely powerful when you map it. Though it wasn’t originally designed for mapping, you can do quite a bit of mapping with our friend ggplot2
. Check this out:
ggplot() +
geom_sf(data = us_states)
From there, you can do all of the usual ggplot tuning to make things look a bit better:
ggplot() +
geom_sf(data = us_states, color = "blue", fill = "beige") +
labs(title = "The US of A") +
theme_minimal()
What if we want to make a new map visualizing the total population in each major region in the US?
us_states %>%
group_by(REGION) %>%
summarize(POP_TOTAL = sum(total_pop_15)) %>%
ggplot() +
geom_sf(aes(fill = POP_TOTAL)) +
labs(fill = "Population") +
theme_bw()
That’s right guys. I just piped a spatial object into a ggplot
to make a map. One of the best things about the sf
package is that it allows us to apply all of the amazing tools we’ve learned to date (hellooooo tidyverse
) to wrangle and explore data.frames
to spatial data.
So, say we want to highlight the fine state of Georgia. We could do something like this:
us_states %>%
mutate(IS_IT_GA = ifelse(NAME == "Georgia", "Georgia", "Just not Georgia")) %>%
ggplot() +
geom_sf(aes(fill = IS_IT_GA)) + # drop data = us_states; the data is now what you pipe into the plot
# also, move the fill into the aes() b/c it applies differently to each shape
labs(title = "The US of A",
fill = "Which state is it??") +
theme_minimal() +
scale_fill_manual(values = c("dark red", "white"))
And say we have a new shapefile that shows the roads across the US? Then we’d simply add a new geom_sf()
call to the map with this new shapefile, like so:
ggplot() +
geom_sf(data = us_states) +
geom_sf(data = us_roads)
Note that to visualize multiple shapefiles on the same plot, they need to be in the same projection.
Finally, this post has some other useful tricks including using the ggspatial
package to add a scale bar and compass:
library(ggspatial)
us_states %>%
mutate(IS_IT_GA = ifelse(NAME == "Georgia", "Georgia", "Just not Georgia")) %>%
ggplot() +
geom_sf(aes(fill = IS_IT_GA)) +
labs(title = "The US of A",
fill = "Which state is it??") +
theme_minimal() +
scale_fill_manual(values = c("dark red", "white")) +
annotation_scale(location = "bl", width_hint = 0.5) +
annotation_north_arrow(location = "br", which_north = "true",
style = north_arrow_fancy_orienteering())
There are many things we can do with spatial data I haven’t mentioned here. One of the most useful is linking multiple spatial datasets so they can “talk” to one another. You can do this by extracting information from one dataset to another (so, for example, extracting to points values of a shapefile or raster) or by merging data spatially or using columns in the dataset. We’ll talk more about extraction when we talk about rasters. To merge, check out the base R
merge()
function, which works like this:
shp1 <- st_read("./data/shp1.shp")
shp2 <- st_read("./data/shp2.shp")
# assume both shapefiles represent US states and have a column that lists the abbreviation for each state called ST_ABBR
# we can then do a merge like this!
shp_final <- merge(shp1, shp1, by = "ST_ABBR", all = T)
When we merge like this, the new shapefile will include all of the columns in both shp1
and shp2
. Careful, for this to work, you need to be sure the state abbreviations are formatted the same. This is why I add the all = T
argument which includes even rows for which the merge didn’t work. This lets you inspect your merged data to see where the merge may have failed.
…that we won’t cover, but that are very cool and that I encourage you to explore:
sp
package.ggplot2
maps
package.R
and spplot()
can be found here.tmap
capabilities.R
.I’ve got a whole tutorial on how to work with this package here for those who are interested!↩
Ok, ok, I’m not getting into the dimensionality of reality, but you get the idea↩
The only person who was ever pumped about projections was an old Flemish dude named Mercator.↩
Volcanologists rejoice! This data exists and I encourage you to use it!↩