R
, then with spdplyr
).rgeos
to compute geometries.Yeah, lots of packages this week. I tried to find the most efficient/elegant tools in the universe of spatial packages in R
, which led me to several different packages.
library(maptools)
library(maps)
library(rgdal)
library(sp)
library(spdplyr)
library(raster)
library(rgeos)
plot()
function while we wait (with great anticipation) for the mapping extravaganza that will happen next week.R
so that you can explore and find more information relevant to your particular analyses.Today we’ll be joining data collected at a hydrological scale (watershed) to data collected at an administrative scale (county). We’ll be working with a county-level shapefile of the entire US and with the Watershed Boundary Dataset (WBD). Our objective is to join counties to the larger hydrologic unit of which they are a part. The US Geological Survey (USGS) has tons of hydrological data available for you to play with. The WBD contains the boundaries of USGS “hydrologic units” (HU) at multiple scales including region (HU2), sub-region (HU4), basin (HU6), sub-basin (HU8), watershed (HU10) and sub-watershed (HU12). Each HU has a HU Code or HUC
that uniquely identifies that water entity. It neat to think about your location in the world not as a member of a town, county, state, and nation, but also as a member of a region, sub-region, basin, watershed, etc. For example, here in Logan, Utah, we are members of:
If you’re not in Logan, you can find out your hydro-citizenry here. If you want an even more detailed version of your hydro-citizenry, check out this amazing, but 3 gigabyte dataset.
Ok, back to the objective. I have a map of national sub-regions (HU 4). I want to determine which counties are located within each region (HU 2). Luckily, some of the HU 2 data is contained in the HU 4 data set. Let’s begin by looking at our data:
cty <- readRDS("./data/county.RDS")
ws <- readRDS("./data/wdb_sub.RDS")
To help facilitate data transfer, I originally loaded the shapefile (.shp
) versions of each of these files into R
and then saved it as an RDS file (which preserves the R-specific sp
structure of the data) using saveRDS()
. This is why you’re able to load this file into R
using the readRDS()
function.
First, always be sure to check the projections of your data:
cty@proj4string
## CRS arguments:
## +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0
ws@proj4string
## CRS arguments:
## +proj=longlat +datum=NAD83 +no_defs +ellps=GRS80 +towgs84=0,0,0
Data is in the same projection - we should be good to go!
sp
objectsBefore we begin, I want to review a few useful things you can do with sp
objects in R
. First, it’s fairly easy to create a new shapefile that is subset of the original dataset. Let’s say I want a shapefile with only HU4 watersheds located in our sub-region of the world, the Bear River sub-region. To find these, I need to know that the HU4 ID for the Bear River sub-region is 01
and then I can simply call:
great_basin <- ws[ws$HUC4 == "01",]
class(great_basin)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
head(great_basin@data)
## OBJECTID TNMID METASOURCE SOURCEDATA
## 176 176 {7492C028-E858-42FB-848C-A49CB81CEE4B} <NA> <NA>
## SOURCEORIG SOURCEFEAT LOADDATE GNIS_ID AREAACRES AREASQKM STATES
## 176 <NA> <NA> 2012/06/11 0 4809301 19462.57 ID,UT,WY
## HUC4 NAME Shape_Leng Shape_Area HUC2
## 176 01 Bear 12.23277 2.111604 16
plot(great_basin)
Or say we want to make a shapefile with only one of the attributes in our larger ws
object:
ws_sub <- ws[, 'HUC4']
class(ws_sub)
## [1] "SpatialPolygonsDataFrame"
## attr(,"package")
## [1] "sp"
Again, what’s nice here is that you can apply simple indexing to a fairly complex spatial object to subset based on row and column values. You can also use base R
to add a variable to a spatial object:
ws$ID <- seq(1, nrow(ws))
head(ws$ID)
## [1] 1 2 3 4 5 6
The next step is to dissolve the smaller HUC4 polygons into larger HUC2 polygons. To do this, we can use a nifty tool from the maptools
package called unionSpatialPolygons()
to create a coarser resolution HUC2 shapefile from the higher resolution HUC4 dataset.
# note that the union may take a second to run
library(maptools)
huc2 <- unionSpatialPolygons(SpP = ws, IDs = ws$HUC2)
plot(huc2, border="red")
Wow. This is cool. Notice, however, that the new object we built lost the attributes in our original higher resolution data and is now just a SpatialPolygon
. R
can handle dissolve polygons, but it doesn’t know how to dissolve the values in the attribute table unless we tell it how to (i.e. sum, mean, or some other function). This is where the aggregate()
function from the sp
package comes in. I like this function because it allows you to apply a function to the objects you’re aggregating. This would be useful if you wanted to, say, compute the total area of the new HUC2 watershed by summing the area of the HUC4 watersheds dissolved into the larger HUC2 area. Note that whatever function you use in aggregate()
must return a single value which will be associated with the larger polygon.
watershed_size <- aggregate(ws["AREAACRES"], huc2, FUN = sum)
watershed_size@data$AREAACRE
## [1] 90699794
Finally, there’s a nifty tool in the rgeos
package that lets you dissolve all polygons in a shapefile into a large shapefile (i.e. all counties in the US into the boundary of the US):
library(rgeos)
boundary <- gUnaryUnion(cty)
plot(boundary)
In many cases, you have several spatial datasets you want to connect in some way. If you’re lucky, you have unique identifiers in each dataset that are formatted exactly the same and can be used to join the two datasets using only an attribute join and the merge()
function (see Attribute Join section below). In many cases, however, you’ll only have the spatial location of the two datasets. Maybe you want to link a point shapefile documenting Big Foot sigtings to a raster describing land use. Maybe you want to do something a bit more mundane like compute the amount of roads within a county. Or perhaps you want to extract remotely sensed metrics of temperature or vegetation health to points in space. You can do this using spatial joins. I’ll walk you through a few examples below, but know that each spatial join is special, i.e. full of it’s own challenges and particularities.1 In general, we’ll be working with the over()
function for spatial joins. over(x,y)
does the following:
y
has no attributes, this returns the indexes of y
corresponding to each feature of x
or NA
in the case of no correspondence.y
has attributes, this function returns a data.frame
with the attributes of y
corresponding to the locations of x
or an NA
in the case of no correspondence.Here, correspondence means that the two spatial features intersect (touch, overlap, cover, include, etc.). If a feature of x
contains matches of multiple features (e.g. points) in y
, then over(x,y)
returns an arbitrary first match. If you want ALL features of y
that fall within each shape x
, then add over(x, y, returnList = TRUE)
. You can also use over()
to apply a function to all instances of y
within each instance of x
, i.e. compute the mean population of cities within a county shapefile. You can do this by passing a function to over()
, i.e. over(x, y, fun = mean())
. Below, I’ll work through some examples.
Suppose you have point data indicating the location of cities in the US. You’re interested in identifying watersheds with a high number of cities; you hypothesize that human activities in these watersheds may impact water quality and want to test this hypothesis. You start by finding a shapefile of US cities (R
and the maps
package makes this easy):
library(maps)
data(us.cities) #loads dataset from maps package
head(us.cities)
## name country.etc pop lat long capital
## 1 Abilene TX TX 113888 32.45 -99.74 0
## 2 Akron OH OH 206634 41.08 -81.52 0
## 3 Alameda CA CA 70069 37.77 -122.26 0
## 4 Albany GA GA 75510 31.58 -84.18 0
## 5 Albany NY NY 93576 42.67 -73.80 2
## 6 Albany OR OR 45535 44.62 -123.09 0
Notice that this is a data.frame
and not a sp
object. To sp
this thing, we need to use the SpatialPointsDataFrame()
function and input coordinates, data, and projection information.
coords <- cbind(us.cities$long, us.cities$lat)
us.cities <- SpatialPointsDataFrame(coords = coords, data = us.cities, proj = huc2@proj4string)
# note that I was lazy and did NOT check the datum for us.cities. You'd want to do this to correctly set the proj parameter
plot(us.cities)
Let’s crop this dataset to only include the cities within the boundaries of our original ws
watershed object. First check projections (identical()
is a cool function to do that quickly) reproject as necessary, and then crop!
identical(huc2@proj4string, us.cities@proj4string)
## [1] TRUE
city_crop <- crop(us.cities, huc2)
plot(city_crop)
plot(ws, add=T)
A lot of nothing in the middle of this watershed, huh? But what appears to be the Wasatch Front definitely pops out! It would be helpful to know which cities are in each smaller watershed. To return a list of large cities in each watershed, let’s use our new ’over()` function:
city_list <- over(ws, us.cities, returnList=T)
city_list$`26`
## name country.etc pop lat long capital
## 144 Carson City NV NV 58350 39.15 -119.74 2
## 746 Reno NV NV 206626 39.54 -119.82 0
## 858 Sparks NV NV 88518 39.54 -119.74 0
This returns a list of the cities in each HUC2 polygon. Even cooler, you can apply functions to all points within a HUC polygon using the fn
argument. Want to know the average population within the HUC2 region?
avg_pop <- over(ws, us.cities[,"pop"], fn = mean)
avg_pop
## pop
## 26 117831.33
## 71 44614.00
## 88 NA
## 143 NA
## 174 89596.55
## 175 45262.00
What about the total population?
total_pop <- over(ws, us.cities[,"pop"], fn = sum)
total_pop
## pop
## 26 353494
## 71 44614
## 88 NA
## 143 NA
## 174 985562
## 175 45262
You can also write your own crazy functions and apply them to the data!
Another useful thing to know how to do is to extract data from a raster to a polygon. Let’s say you’re working with some CropScape data from Oconee County, South Carolina (a lovely place, y’all):
oconee <- raster("./data/CDL_2016_45073.tif")
plot(oconee)
…and imagine you’d like to compute the percent of land in the county covered in Open Water
(code 111
)2. Using the extract()
function in the raster
package, you can pull this information out to the county-level (impressed?)3:
extract <- raster::extract #tell R that when I call extract, I'm talking about the raster function
# extract OC shapefile from the larger county dataset
oconee_shp <- cty[cty$GEOID == "45073",]
# reproject oconee shape to the projection of the CdL data
oconee_shp <- spTransform(oconee_shp, oconee@crs)
# extract all pixels values for this county
all_pixels <- extract(oconee, oconee_shp, na.rm=T, df=T) # may take a second
So after prepping the data (all functions you should be familiar with by now), I used extract()
to extract a list of pixels in the oconee
shapefile. I added the na.rm=T
agruments to drop NA
pixels and the df=T
argument to return the list of pixels as a data.frame
rather than as list
, which allows me to apply the length()
function and indexing to the extracted data. In the case of this raster, extract()
returned a data.frame
with two columns, an ID
column, and a column with the actual data called CDL_2016_45073
. The ID
column can be useful when you’re working with multiband rasters. It allows you to apply extract to multiple layers in these rasters and to keep tract of which pixels belong in which band. Let’s find the percent of Oconee County covered in water (code 111
in the CropScape data). First, we have to pull the data out of the extracted data.frame - in this case we don’t need to worry about the ID column since we only have one band of data, but if you had multi-band data you could use filter()
to pull out data only for the band of interest to you:
ras_data <- all_pixels$CDL_2016_45073
# find the percent of pixels == 111
length(ras_data[ras_data == 111])/length(ras_data)
## [1] 0.06894612
Another more elegant option here is to create a function to compute the amount of water in a shapefile and feed that to the extract function. This returns only the percent water, rather than all the pixel values for the county boundary.
# write a function to
water <- function(x, na.rm = na.rm) {
length(x[x==111])/length(x)
}
extent_water <- extract(oconee, oconee_shp, fun = water, na.rm=T, df=T)
extent_water
## ID CDL_2016_45073
## 1 1 0.06894612
You can also extract based on a buffer around a point, which can come in handy. More info on how to do this can be found here.
A tip here. As you’re extracting rasters, it’s much less computationally intensive to reproject the shapefile to which you are extracting than the raster… rasters are often much bigger and take much longer to reproject.
If you think you’ll be extracting raster values to points or polygons, I strongly recommend this tutorial. Finally, for the very bold folks who have to extract massive amounts of high resolution raster data, you will definitely want to check out this post.
Extracting from a raster to a series of points follows the same pattern. If you have a set of points in Oconee County (alas, my home county has little GIS data so I was unable to find any), simply do the following:
value <- extract(oconee, mypoints)
This will return the value of the CropScape raster at each point in mypoints
.
We’ve already learned how to crop shapefiles and rasters, but here are a few reminders and a new cropping tool from the rgeos
package just to broaden your toolkit. Say I want to extract us.cities
in the state of California. As you now know, we can do this using the crop()
function from the raster
package.
library(raster)
states <- shapefile("./data/states.shp")
ca <- states[states@data$STATE_NAME == "California",]
ca_cities <- crop(us.cities, extent(ca))
plot(ca)
plot(ca_cities, add=T)
Don’t forget to check the projection of both shapefiles before you crop! Notice too that the crop()
function crops to the extent of the CA
shapefile, which is a square (bounding box). This is why some cities outside of CA are included. If you want to crop to a shape, use the gIntersection()
tool from rgeos
or some of the tools we saw during the raster module:
library(rgeos)
ca_cities_only <- gIntersection(us.cities, ca)
plot(ca)
plot(ca_cities_only, add=T)
The best way to merge an existing data.frame
to a spatial object is with the merge()
function. Remember that for this function to work, you’ll need to have a column in each dataset with a unique identifier that can be used to link the two datasets. These two columns can have different names, but their content must be the same (i.e. if the unique ID for Logan UT in one dataset is “10012” then the other dataset ID for Logan needs to be exactly the same, “10012”). This can be tricky at times and can involve some string manipulation to get IDs in the same format. Here’s a quick example to give you a sense of how the merge()
function works. Imagine you want to link our us.cities
data to a shapefile of US states.
states <- readOGR("./data/states.shp")
## OGR data source with driver: ESRI Shapefile
## Source: "./data/states.shp", layer: "states"
## with 51 features
## It has 5 fields
plot(states)
plot(us.cities, add=T, col = "red")
Merging these datasets is pretty easy since they both contain a column that holds the abbreviations of US states. We’ll use these columns to merge the datasets. I always like to start by inspecting the two columns I’ll use to perform the merge to check for any differences or weird values:
unique(states@data$STATE_ABBR)
## [1] HI WA MT ME ND SD WY WI ID VT MN OR NH IA MA NE NY PA CT RI NJ IN NV
## [24] UT CA OH IL DC DE WV MD CO KY KS VA MO AZ OK NC TN TX NM AL MS GA SC
## [47] AR LA FL MI AK
## 51 Levels: AK AL AR AZ CA CO CT DC DE FL GA HI IA ID IL IN KS KY LA ... WY
unique(us.cities@data$country.etc)
## [1] "TX" "OH" "CA" "GA" "NY" "OR" "NM" "LA" "VA" "PA" "FL" "IA" "AK" "IN"
## [15] "MA" "MI" "MD" "MN" "WI" "IL" "CO" "NC" "NJ" "AL" "WA" "ME" "AZ" "TN"
## [29] "NE" "MT" "MS" "ND" "MO" "ID" "UT" "KY" "CT" "OK" "NV" "WY" "SC" "WV"
## [43] "NH" "AR" "RI" "DE" "HI" "KS" "VT" "SD" "DC"
mode(states@data$STATE_ABBR)
## [1] "numeric"
mode(us.cities@data$country.etc)
## [1] "character"
The first issue is that the states
column is a numeric
vector while the us.cities
column is a character
vector. Let’s fix this:
states@data$STATE_ABBR <- as.character(states@data$STATE_ABBR)
unique(states@data$STATE_ABBR)
## [1] "HI" "WA" "MT" "ME" "ND" "SD" "WY" "WI" "ID" "VT" "MN" "OR" "NH" "IA"
## [15] "MA" "NE" "NY" "PA" "CT" "RI" "NJ" "IN" "NV" "UT" "CA" "OH" "IL" "DC"
## [29] "DE" "WV" "MD" "CO" "KY" "KS" "VA" "MO" "AZ" "OK" "NC" "TN" "TX" "NM"
## [43] "AL" "MS" "GA" "SC" "AR" "LA" "FL" "MI" "AK"
Nice. Now let’s merge these datasets. By using the by.x
and by.y
arguments, we can specify the columns in each data set with different names we intend to use for the merge:
city_state <- merge(us.cities@data, states@data, by.x = "country.etc", by.y = "STATE_ABBR")
head(city_state)
## country.etc name pop lat long capital STATE_NAME
## 1 AK Juneau AK 31187 58.30 -134.42 2 Alaska
## 2 AK Anchorage AK 279428 61.18 -149.19 0 Alaska
## 3 AL Dothan AL 62449 31.24 -85.41 0 Alabama
## 4 AL Birmingham AL 229300 33.53 -86.80 0 Alabama
## 5 AL Mobile AL 188467 30.68 -88.09 0 Alabama
## 6 AL Huntsville AL 169155 34.71 -86.63 0 Alabama
## DRAWSEQ STATE_FIPS SUB_REGION
## 1 51 02 Pacific
## 2 51 02 Pacific
## 3 43 01 East South Central
## 4 43 01 East South Central
## 5 43 01 East South Central
## 6 43 01 East South Central
Our final data set links every entry in the us.cities
shapefile (1005 in total) to it’s state and appends data from the state
shapefile to the us.cities
information. Notice that I’m referencing the data.frame
in the spatial objects using @data
for the merge. This is because merge()
will not bring together two shapefiles with different geometries. This makes sense… it’s impossible for R
to create a spatial object with both points and polygons held in the same place. If you need to merge spatial objects, remember the over()
function we discussed above! If, however, you want to merge a data.frame
to your spatial object (say data describing characteristics of the us.cities
or states
) then you can simply use the merge()
function as shown above.
ATTENTION: Don’t even think about computing geometries if your data is NOT projected!!
Oftentimes, we want to do something like compute the distance between two objects or the area of an object. We can do this in R
! Let’s say I’ve got data in cemeteries in Utah (weird, but this data exists). Say I want to find the average distance between cemeteries in the state:
library(rgeos)
cem <- shapefile("./data/Cemeteries.shp")
dist_matrix <- gDistance(spgeom1 = cem, spgeom2 = NULL, byid=T)
dist_matrix[1:10, 1:10]
## 1 2 3 4 5 6 7
## 1 0.00 259099.244 315823.76 95417.1 251323.499 81156.52 338058.50
## 2 259099.24 0.000 125252.03 341570.0 7812.565 195608.38 169224.33
## 3 315823.76 125252.034 0.00 375628.8 126793.686 236327.68 44596.34
## 4 95417.10 341570.008 375628.81 0.0 333774.765 146882.00 388880.24
## 5 251323.50 7812.565 126793.69 333774.8 0.000 187863.80 170313.75
## 6 81156.52 195608.382 236327.68 146882.0 187863.799 0.00 257072.46
## 7 338058.50 169224.330 44596.34 388880.2 170313.752 257072.46 0.00
## 8 376135.18 131112.787 206926.27 464773.1 138104.619 322029.01 247787.01
## 9 398451.36 151137.782 217565.53 486778.6 158354.598 343563.16 256707.15
## 10 346127.99 96206.562 175803.40 433140.8 103394.623 289131.68 218057.53
## 8 9 10
## 1 376135.18 398451.36 346127.99
## 2 131112.79 151137.78 96206.56
## 3 206926.27 217565.53 175803.40
## 4 464773.06 486778.64 433140.79
## 5 138104.62 158354.60 103394.62
## 6 322029.01 343563.16 289131.68
## 7 247787.01 256707.15 218057.53
## 8 0.00 22491.96 35497.02
## 9 22491.96 0.00 54962.35
## 10 35497.02 54962.35 0.00
gDistance()
from the rgeos
package computes the distance between points either in two separate SPDF
objects (you can specify spgeom1
and spgeom2
) or the distance of points within a shapefile (spgeom2 = NULL
as above). The function returns a matrix of distance between points. Note that the diagonal is zero, i.e. the distance between a point and itself is zero. To avoid skewing our average distance, I’ll set the diagonal of this matrix to NA
then compute the average distance between cemeteries:
diag(dist_matrix) <- NA
mean(dist_matrix, na.rm=T)
## [1] 216501
We can also use rgeos
to compute area of polygons. Let’s compute the surface area of all counties in Utah:
ut <- cty[cty@data$STATEFP == 49,]
ut_areas <- gArea(ut, byid=T)
HAH! Gotcha! Be sure to project your data!! I will project my Utah dataset to the projection used frequently by folks in the state of Utah:
crs <- "+proj=utm +zone=12 +datum=NAD83 +units=m +no_defs +ellps=GRS80 +towgs84=0,0,0"
ut <- spTransform(ut, crs)
ut_areas <- gArea(ut, byid=T)
head(ut_areas)
## 252 253 254 255 256 257
## 6697859970 17422718654 10632734968 1580948776 2085507623 4869604901
These are huge numbers because our projection is in meters. If you’ll be working a lot with distance, area or other geometries, check out the rgeos
package. There are other great functions like gIntersection()
and gOverlaps()
that may be of interest to you.
dplyr
to talk to sp
You may have noticed a few annoying things about how dplyr
talks to objects in the sp
package. If I want to take a shapefile, say my us.cities
shapefile, and use dplyr
to extract cities with a population of more than 500,000 people, then I do the following:
big_cities <- us.cities@data %>% filter(pop > 500000) %>% arrange(desc(pop))
head(big_cities)
## name country.etc pop lat long capital
## 1 New York NY NY 8124427 40.67 -73.94 0
## 2 Los Angeles CA CA 3911500 34.11 -118.41 0
## 3 Chicago IL IL 2830144 41.84 -87.68 0
## 4 Houston TX TX 2043005 29.77 -95.39 0
## 5 Phoenix AZ AZ 1450884 33.54 -112.07 2
## 6 Philadelphia PA PA 1439814 40.01 -75.13 0
That’s easy enough. But did you notice something about our data? We started with a nice SpatialPointsDataFrame
(us.cities
) and ended up with a data.frame
(big_cities
):
class(us.cities)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
class(big_cities)
## [1] "data.frame"
The big_cities
data.frame
still has the coordinate information in it, so it’s easy to transform it back into a SPDF
:
big_cities_SPATIAL <- SpatialPointsDataFrame(coords = cbind(big_cities$long, big_cities$lat), data = big_cities)
plot(big_cities_SPATIAL)
…but this is annoying, right? There’s a fairly new package called spdplyr
which, as you may have guessed, integrates the sp
package with dplyr
. If you haven’t already installed this package, be sure to do so. It’s pretty useful.
library(spdplyr)
big_cities <- us.cities %>% filter(pop > 500000) %>% arrange(desc(pop))
class(big_cities)
## [1] "SpatialPointsDataFrame"
## attr(,"package")
## [1] "sp"
Hooray! And just in case you don’t believe it’s spatial:
plot(big_cities)
Keep the spdplyr
package in mind as you wrangle spatial data. It’ll save the annoying step of having to transform your data between a spatial object and a data.frame
.
rgeos
package should your project involve geometry computation. Pages 132-142 are especially helpful.