R
by Adrian Baddeley. This is an immensely helpful overview of point pattern theory and its implementation in R
. If you plan on applying point pattern analysis to your data, I strongly recommend you download and read this text, particularly pages 6 - 17 and 178 - 180.R
to plot the results of our analyses. You can generate similar plots using ggplot2
, but since the focus of this lab is understanding point pattern analysis and the spatstat
package, we will only use the simple plot()
function.library(sp)
library(tidyverse)
library(spatstat)
library(raster)
NOTE: Point pattern analysis depends on DISTANCES so be absolutely sure that your data is in a PROJECTED COORDINATE SYSTEM.
Today, I’ll work with a few data sets. Many some standard with the spatstat
package (lansing
, longleaf
, etc.). One set of data you will have to download from Chapter 8 of the R Spatial online textbook. At the top of Chapter 8, you will see hyperlinked crime
, city
, and census
.rds
files which you should download and store locally on your computer before beginning the tutorial. These .rds
files are shapefiles containing census data, a city boundary, and incidences of crime.
city <- readRDS("./data/city.rds")
census <- readRDS("./data/census2000.rds")
crime <- readRDS("./data/crime.rds")
plot(city)
plot(crime, add=T, col = "red" , cex = 0.5, pch = "*")
The city
SpatialPolygonsDataFrame
delineates the boundary of a city. The census
data set contains lots of data on the demographics of census blocks within the city. Crime
contains information about crimes that occurred in the city as well as their location. The types of crimes and number of observations within each crime category are listed below:
crime@data %>% group_by(CATEGORY) %>% summarize(n = n()) %>% arrange(desc(n))
## # A tibble: 15 x 2
## CATEGORY n
## <chr> <int>
## 1 Petty Theft 665
## 2 Vandalism 355
## 3 Drunk in Public 232
## 4 Vehicle Burglary 221
## 5 Residential Burglary 219
## 6 DUI 212
## 7 Assaults 172
## 8 Commercial Burglary 143
## 9 Grand Theft 143
## 10 Drugs or Narcotics 134
## 11 Auto Theft 86
## 12 Robbery 49
## 13 Weapons 15
## 14 Arson 9
## 15 Hate Crimes 6
Point data gives the location of objects (trees) or events (crime) within an area. Often, in addition to coordinates, points have other data attached to them called marks. Marks are essentially attributes of the point (i.e. species of tree, type of crime). Marks could be categorical (type of crime) or continuous (diameter of tree). Marks are different from covariates. Marks are attributes associated with the point. Covariates are any kind of data that can be linked to the points as an explanatory variable (climate data, elevation, socioeconomic data, land use, etc.). We can answer a number of important questions with point-pattern data, including:
We can think of a point process as a black box that generates a random observed spatial point pattern according to some rules. Point processes are assumed to be (1) random (locations and numbers of points are random) and (2) do not have a serial order (unless there are marks attached). Point processes are located within a sampling window which is important to identify and define.
spatstat
In this tutorial, we’ll be working with the spatstat
package, so make sure you’ve installed this on your machine. spatstat
stores data a bit differently than the sp
package, so I’ll begin by explaining the peculiarities of this package. spatstat
classifies objects in the following way:
ppp
: planar point patternowin
: spatial region or observation windowim
: pixel imagepsp
: a pattern of line segments (we won’t cover this in our class, but if you are working with polylines and are interested, check out this text)tess
: tesselations, tiling using shapes, think shapefileThere are a series of methods associated with each class. To get this list, type:
methods(class = "ppp") # replace ppp with class name of interest
## [1] [ [<- affine anyDuplicated
## [5] as.data.frame as.im as.layered as.owin
## [9] as.ppp auc berman.test boundingbox
## [13] boundingcentre boundingcircle boundingradius by
## [17] cdf.test circumradius closepairs closing
## [21] coerce connected coords coords<-
## [25] crossdist crosspairs cut density
## [29] dilation distfun distmap domain
## [33] duplicated edit envelope erosion
## [37] fardist flipxy Frame<- has.close
## [41] head identify initialize intensity
## [45] iplot is.connected is.empty is.marked
## [49] is.multitype kppm markformat marks
## [53] marks<- multiplicity nnclean nncross
## [57] nndensity nndist nnfun nnwhich
## [61] nobjects npoints opening pairdist
## [65] pcf periodify pixellate plot
## [69] ppm print quadrat.test quadratcount
## [73] quantess rebound relevel relrisk
## [77] rescale rhohat roc rotate
## [81] round rounding rshift scalardilate
## [85] scanmeasure segregation.test sharpen shift
## [89] show slotsFromS3 Smooth Smoothfun
## [93] split split<- subset summary
## [97] superimpose tail unique unitname
## [101] unitname<- unmark unstack Window
## [105] Window<-
## see '?methods' for accessing help and source code
It’s fairly easy to transform data between SpatialPoints
sp
objects and the ppp
objects used in spatstat
. Let’s start by loading data from the spatstat
package:
data("swedishpines")
x <- swedishpines
plot(x)
This sample data sets plots the occurrence of Swedish Pines within a small window.
class(x)
## [1] "ppp"
summary(x)
## Planar point pattern: 71 points
## Average intensity 0.007395833 points per square unit (one unit = 0.1
## metres)
##
## Coordinates are integers
## i.e. rounded to the nearest unit (one unit = 0.1 metres)
##
## Window: rectangle = [0, 96] x [0, 100] units
## Window area = 9600 square units
## Unit of length: 0.1 metres
Notice that the x
variable we created is of class ppp
. We see that it includes 71 points
and that the average intensity (points/unit area) of these points is .0074 points per square unit
or in this case 0.74 trees per square meter.
Unless you acquire magical data, your data will not come to you in ppp
format. Working things into ppp
(then later sp
for advanced plotting) will help you easily apply the complex analyses and functions listed below to your data, so we’re going to work through an example of loading a ’.csv
of coordinates and attributes into R
, transforming this to a ppp
object, then to an sp
SpatialPoints
object. Let’s look at the location of dumps in the western US:
lf <- read.table("./data/landfill.txt", sep=",", stringsAsFactors = F, skip = 1)
colnames(lf) <- c("id", "lat", "lon")
# 1. store x and y coords in two vectors
lon <- lf$lon
lat <- lf$lat
# 2. create two vectors xrange and yrange with dimensions of triangle that contain all points
xrange <- range(lon, na.rm=T)
yrange <- range(lat, na.rm=T)
# 3. create ppp
lf <- ppp(lon, lat, xrange, yrange)
## Warning: data contain duplicated points
plot(lf)
We can add marks to the ppp
object by specifying them when we make the object. Let’s say we have an additional attribute that measures the size of the landfill in square kilometers, stored in a vector called size
:
lf <- ppp(lon, lat, xrange, yrange, marks = size)
plot(lf, which.marks = "size")
It’s a good idea to check your data, as usual, for missing data or duplicated data. You can do this by looking for repetitions in a two-column data.frame
of your coordinates:
coords <- cbind.data.frame(lon, lat)
any(duplicated(coords))
## [1] TRUE
To remove duplicates from a ppp
object, simply do:
lf <- unique(lf)
But just like missing data, it’s important to understand why there are duplicates, so be sure to interrogate your data.
To convert this ppp
object to a SpatialPoints
object, we can use the following from the maptools
package:
lfsp <- as(lf, "SpatialPoints")
class(lfsp)
## [1] "SpatialPoints"
## attr(,"package")
## [1] "sp"
plot(lfsp)
spatstat
functionsEach ppp
object contains the following, indexed by a dollar sign:
n
, the number of pointsx
and y
, the x and y coordinates of each pointmarks
, any attributes of the pointswindow
, which describes the observational window for the data.There are a number of functions that work well with the spatstat
package that allow you to extract useful info from your ppp
object, x
, such as:
npoints(x)
marks(x)
coords(x)
as.owin(x)
as.data.frame(x)
For the FULL list of things you can do with these objects, see the Baddeley text, particularly p. 62.
It’s common for researchers to be interested in how points are distributed on a two-dimensional surface. Points may represent sample locations, species events, or instances of disease. For example, as a researcher collecting samples or establishing control points, you may be concerned about the uniformity of these points across space. The patterns of points on maps can be classified as:
We can informally inspect the distribution of our point process by plotting the point pattern and observing whether points appear to be random, uniform, or clustered. The quickest way to visually inspect our data is to plot()
it, as we did above. Another useful visualization tool is to divide our data into quadrants and count the number of points within each quadrant, like so:
Let’s learn a bit more about how to visually inspect the spatial patterns in our point data by reproducing the plot to compute and visualize a point processes’ intensity.
We can think of the intensity of a point process as similar to the sample mean. The intensity is the average density of points (i.e. number of points per unit area). Intensity can be constant (uniform or homogenous) or may vary across space. We can compute the intensity manually by summing the the points in a shapefile over the total area of the shapefile:
dens <- nrow(crime)/area(city) #area from rgeos package
print(paste("The density is:", dens))
## [1] "The density is: 9.61537319061061e-06"
We can visualize how the intensity of crime varies across space by creating a grid (often called quadrats) and counting the number of crimes in each grid cell:
# create raster with the extent of the city
r <- raster(city)
# set raster resolution to 1000 feet
res(r) <- 1000
# the r raster fills the entire bounding box of the city shapefile. To clip only to the city boundaries, we can do the following:
r <- rasterize(city, r)
# add quadrants by turning raster pixels into polygons
quads <- as(r, 'SpatialPolygons')
# add points dataset
plot(quads)
plot(crime, add=T, col="red", cex=0.2)
We can count the number of crimes in each quadrant and turn this count data into a raster using rasterize()
:
nc <- rasterize(coordinates(crime), r, fun = 'count', background = 0)
plot(nc)
plot(city, add=T)
The areas in green and yellow are areas with a higher intensity of crime. To mask out areas of the nc
raster outside of the city, use the mask()
function and the r
raster we made above with the extent of the city
shapefile. If we fail to mask areas outside of the main region, we will have lots of zeros in our crime frequency data, which will skew our results:
ncrimes <- mask(nc, r)
plot(ncrimes)
plot(city, add=T)
Let’s look at the distribution of crime frequencies in the city. Notice how I index the data in the raster
object. You could also use getValues()
:
hist(ncrimes@data@values, 100, main = "Distribution of crimes in quadrats")
This makes sense given our visualizations of the data. Most quadrats have little crime and a few quadrats have exceptionally high rates of crime. Let’s compute the average number of crimes per quadrat:
mean(ncrimes@data@values, na.rm=T)
## [1] 9.261484
These visualizations and descriptions are cool, but what if we want to estimate a continuous crime desity surface for the city? We can use the spatstat
package to do this, but first, we need to convert our SpatialPointsDataFrame
crime
to a ppp
object.
lon <- crime@coords[,1]
lat <- crime@coords[,2]
# 2. create two vectors xrange and yrange with dimensions of triangle that contain all points
xrange <- range(lon, na.rm=T)
yrange <- range(lat, na.rm=T)
# 3. create ppp
crime_ppp <- ppp(lon, lat, xrange, yrange, marks = as.factor(crime$CATEGORY)) #as factor piece important for splitting to plot later
## Warning: data contain duplicated points
plot(crime_ppp, main = "Our crime ppp object")
The plot is a mess, but hooray! We’ve successfully created a marked ppp
object. Now let’s create quadrats and plot intensity using spatstat
:
q <- quadratcount(crime_ppp, nx = 7, ny = 3)
plot(crime_ppp, cex = 0.5, pch = "+")
plot(q, add=T, cex=2, main = "Groovy quadrat plot")
Quadrats are cool and all, but kernel density is way cooler. The density
function creates a kernel density surface which is stored in R
as spatstat
class im
or image. This is how spatstat
handles rasters.
ds <- density(crime_ppp)
class(ds)
## [1] "im"
Now let’s plot our density surface:
plot(ds, main = "Crime density")
Wait, what’s happening here? Kernel density estimation is an approach to estimate the probability density function of a random variable. Think of KDE this way: for each event (crime), there’s a mini-function that spreads the effect of this crime across space. A KDE essentially sums all of these mini-functions to create a surface indicative of the density of events across space. KDE estimation here uses an isotropic Gaussian kernel1 To a certain extent, you can think of a KDE as a 3D, smoothed histogram, however the actual function used in KDE is more complex than binning for a histogram. One of the most important considerations when using KDE is the selection of the functions bandwidth. The bandwidth of a KDE is essentially a parameter in the KDE function that indicates how smooth you want the data to be. Too smooth (high bandwidth) can mask local variations in intensity that are important, and too jagged (low bandwidth) can highlight local anomalies that don’t really reflect the overall spatial pattern. Put another way, a large bandwidth results in a very generalized surface of the event density while a small bandwidth focuses on smaller scale, event-level variation. Kernels are typically smoothed such that the standard deviation of the smoothing kernel is used as the bandwidth (see ?density
). You can specify the bandwidth easily in the density()
function. Look at ?bandwidth
for different options currently available.
Visualization is important, but does not stand up to the rigors of science. It’s important to know how to implement statistical tests to determine whether your data is randomly distributed through space.
In the tests that follow, we can think of our “null” hypothesis for the spatial distribution of our points as one of complete spatial randomness (CSR). By comparing our data to this null of randomness, we can determine whether our point process is significantly clustered in space. A CSR process has three basic properties:
Said in plain Engilsh: under the assumption of CSR, points are independent of each other and have the same likelihood of being found at any location. These assumptions are almost never met; it would be pretty boring to be a researcher if they were! Instead, the location of points is driven by point attributes (marks), other covariates, and random noise.
So I just dropped some stats jargon withour explanation. Shame on me. You’re probably wondering what this Poisson thing is about. No, francophiles, I am not talking about fish. I’m talking about a discrete probability distribution named after Monsieur Simeon Denis Poisson in the early 1800s. Poisson was pretty prolific, as many scholars were back in the day. Today, he is best known for developing a distribution that captures the probability of a number of events occurring within a fixed interval in time and space. Poisson distributions are often used for rare event data (meteors, extreme weather, crime) and for the occurrence of events in space, i.e. our point process data sets. Curious to see what a homogenous Poisson point process (i.e. CSR process) looks like? Me too. With the rpoispp()
function in spatstat
, I can generate a number of points following CRS:
plot(rpoispp(100), main = "CSR 1")
plot(rpoispp(100), main = "CSR 2")
Each time you run the rpoispp()
function, it generates a new CSR process. Notice that it’s hard to detect any real patterns in the point process; no clear clustering, no uniformity - i.e. CSR!
We care less about CSR on its own than about comparing our data to CSR models to say, with statistical confidence, whether or not our data is CSR. Non-randomness, of course, justifies fun exploration of covariates, marks, etc. Yeah, believe it or not, this stuff can be fun. The classic way we compare our data to a CSR process is to compute a \(\chi^2\) test based on quadrat counts. If you forgot how a \(\chi^2\) test works, go peruse your intro stats textbook or watch this. We can compare our data to the null of CSR (i.e. fairly uniform quadrat count) using quadrat.test()
:
quadrat.test(crime_ppp, nx = 10, ny = 15)
##
## Chi-squared test of CSR using quadrat counts
## Pearson X2 statistic
##
## data: crime_ppp
## X2 = 9267.5, df = 149, p-value < 2.2e-16
## alternative hypothesis: two.sided
##
## Quadrats: 10 by 15 grid of tiles
Hey, look, we have a reallllyyy low p-value. This means we can reject the null hypothesis of CSR, i.e. our crime data has some funky spatial stuff going on! HOORAY! Well, kind of. This is a pretty generic test. It basically just tells us that our data is not a homogeneous Poisson process, but not much else. Also, the power of the quadrat test drops to zero as your quadrats become excessively small or large… so choose wisely and interpret with caution.
A better way to test for CSR is the Kolmogorov-Smirnov test (no vodka connoisseurs, not that Smirnov). You may have already encountered this test as it is pretty widely used to assess goodness-of-fit and/or how well sample data from some unknown population fits some hypothetical model of a specific population. All of the tests you’ll see in the next sections use this basic approach: compare our empirical (sampled) data to a theoretical (modeled) ideal and see if the two are significantly different. In our case, the theoretical model is a CSR processes. We can plot both the empirical and theoretical models together in cumulative form, scaled so their cumulative sums are zero (i.e. express both as cumulative probability distributions). We then look for the greatest difference between the two. The maximum distance is the Kolmogorov-Smirnov statistic, i.e. \(D = max|CDF - EDF|\) where \(CDF\) is the theoretical cumulative distribution function and \(EDF\) is the empirical cumulative distribution function.
With point data, we specify a real function \(T(x,y)\) defined at all locations \(x,y\) in our sampling window. We evaluate this function at each of the data points and compare the empirical values of \(T\) with the predicted distribution of \(T\) under the CSR assumptions. spatstat
makes this painless:
ks <- cdf.test(crime_ppp, "x")
plot(ks)
pval <- ks$p.value
pval
## [1] 0
Here, I’m just evaluating my sample data using the x
coordinate. You can also input a density function (see below) as a covariate, to estimate overall spatial randomness:
ds <- density(crime_ppp)
k <- cdf.test(crime_ppp, ds)
plot(k)
In all cases, our low (zero) p-values suggest that our crime data data was not drawn from a population distributed according to the assumptions of CSR. We can also see this in our plots - the observed and expected (if our data were CSR) cumulative distributions are pretty different.
Testing for the complete spatial randomness of our data will tell us whether or not our point process is random. It won’t, however, tell us how our data is non-random, i.e. are points closer together than a random process or more structured than a random process. It’s useful to be able to test for whether our non-random data is regular (i.e. points tend to avoid each other) or clustered (points like to hang together). The main statistical techniques for answering this question compare theoretical and empirical models of distance between points in a point pattern process. We can think of distance between points in several ways, each of which has its own associated statistic of spatial dependence. Distance can be thought of in terms of:
These are easy to compute in spatstat
with the pairdist()
, nndist()
, and distmap()
functions, respectively.
The \(G\) function measures the distribution of distances from an arbitrary event to its nearest event (i.e. uses nearest neighbor distances). By plotting our empirically estimated \(G\) function against our theoretical expectation if the process were \(CSR\), we can assess the extent to which the empirical distribution of our data is different from the theoretical \(CSR\) expectation.
gtest <- Gest(crime_ppp)
gtest
## Function value object (class 'fv')
## for the function r -> G(r)
## .....................................................................
## Math.label Description
## r r distance argument r
## theo G[pois](r) theoretical Poisson G(r)
## han hat(G)[han](r) Hanisch estimate of G(r)
## rs hat(G)[bord](r) border corrected estimate of G(r)
## km hat(G)[km](r) Kaplan-Meier estimate of G(r)
## hazard hat(h)[km](r) Kaplan-Meier estimate of hazard function h(r)
## theohaz h[pois](r) theoretical Poisson hazard function h(r)
## .....................................................................
## Default plot formula: .~r
## where "." stands for 'km', 'rs', 'han', 'theo'
## Recommended range of argument r: [0, 243.1]
## Available range of argument r: [0, 797.87]
plot(gtest)
So what’s happening here? Gest
estimates our theoretical CDF if our process were a homogeneous Poisson point process (blue line and theo
in the print out). We can compare this with our actual, empirical CDF, here estimated according to three different approaches to account for edge effects (km
, bord
, han
). Accounting for edge effects is crucial, as points at the edge of the sampling window may be close to points just outside of the window, but are counted as far from other points because of the window boundary. Detailed documentation on how these edge assumptions differ can be found by typing ?Gest
. As always, test sensitivity to different assumptions!
For all values of r
(distances between points) our empirical function is greater than the theoretical one. This suggests that nearest neighbor distances in the point pattern are shorter than for a Poisson process, suggesting a clustered pattern. If our empirical CDF had been lower than the theoretical one, this would suggest a more regular pattern.
The \(F\) test measures the distribution of all distances from an arbitrary point to its nearest event (i.e. uses empty space distances). The function is also called the empty space function because it is a measure of the average space left between events.
ftest <- Fest(crime_ppp)
ftest
## Function value object (class 'fv')
## for the function r -> F(r)
## .....................................................................
## Math.label Description
## r r distance argument r
## theo F[pois](r) theoretical Poisson F(r)
## cs hat(F)[cs](r) Chiu-Stoyan estimate of F(r)
## rs hat(F)[bord](r) border corrected estimate of F(r)
## km hat(F)[km](r) Kaplan-Meier estimate of F(r)
## hazard hat(h)[km](r) Kaplan-Meier estimate of hazard function h(r)
## theohaz h[pois](r) theoretical Poisson hazard h(r)
## .....................................................................
## Default plot formula: .~r
## where "." stands for 'km', 'rs', 'cs', 'theo'
## Recommended range of argument r: [0, 814.78]
## Available range of argument r: [0, 814.78]
plot(ftest)
The Fest
function also computes several versions of the \(F\) function using different edge effect estimators (cs
, bord
, km
- review package documentation for more info). Inversely to the \(G\) function, empirical values above our theoretical values (blue) indicate that empty space distances in our empirical point pattern are shorter than for a Poisson process, suggesting a regularly spaced pattern. The opposite would indicate a clustered pattern. In plain English? Our data appears to be clustered.
The \(K\) function measures the number of events found up to a given distance of any particular event (i.e. uses pairwise distances). By comparing estimated values of the \(K\) function to the theoretical ones, we can determine what kind of interaction exists between variables. Think of this function as the expected number of points of the process within in a given distance of a typical point in the process. For a homogeneous Poisson process, there is a constant number of expected points falling within some distance. Again, we compare our estimated empirical \(K\) function with the Poisson \(K\) functions. Empirical values less than our theoretical ones suggest a regular pattern. Empirical values greater than theoretical values suggest clustering.
ktest <- Kest(crime_ppp)
ktest
## Function value object (class 'fv')
## for the function r -> K(r)
## ..............................................................
## Math.label Description
## r r distance argument r
## theo K[pois](r) theoretical Poisson K(r)
## border hat(K)[bord](r) border-corrected estimate of K(r)
## trans hat(K)[trans](r) translation-corrected estimate of K(r)
## iso hat(K)[iso](r) isotropic-corrected estimate of K(r)
## ..............................................................
## Default plot formula: .~r
## where "." stands for 'iso', 'trans', 'border', 'theo'
## Recommended range of argument r: [0, 3476.4]
## Available range of argument r: [0, 3476.4]
plot(ktest)
This test confirms what we saw in our \(G\) and \(F\) test results: our data is clustered.
If you are working with point data with multiple attributes, you’re working with marked data. Our crime
data, for example, is marked with a number of attributes including type of crime. Below, we’ll be working with a dataset that contains the location of trees as well as tree species, diameter and height. There are a number of things you can do with marked data which I’ll review here. Note that for data to be treated as a marked point process, the points should still be random, i.e. looking at temperature variations recirded at coordinates for weather stations does not count since the point observations are at a fixed set of locations (if this is confusing, refer to the pre-assignment readings). What about core sampling locations? Nope. Normally these are selected by a scientist, i.e. not random. Crime? Ostensibly random. Tree location in forest? As long as they weren’t intentionally planted in specific locations, we can assume these are random. Even with seemingly random data, there are different ways to think about the randomness. From p. 179 in the Baddeley text, these include:
These null hypotheses are each different. It’s important to think through this before you start playing with marked point data.
An important note before we start. If your marks are categorical, make sure they are stored as factors
in your ppp
object.
Ok let’s GO. We’ll start by working with the lansing
data set stored in the spatstat
package that includes categorical information on tree locations and species.
data(lansing)
summary(lansing)
## Marked planar point pattern: 2251 points
## Average intensity 2251 points per square unit (one unit = 924 feet)
##
## *Pattern contains duplicated points*
##
## Coordinates are given to 3 decimal places
## i.e. rounded to the nearest multiple of 0.001 units (one unit = 924 feet)
##
## Multitype:
## frequency proportion intensity
## blackoak 135 0.05997335 135
## hickory 703 0.31230560 703
## maple 514 0.22834300 514
## misc 105 0.04664594 105
## redoak 346 0.15370950 346
## whiteoak 448 0.19902270 448
##
## Window: rectangle = [0, 1] x [0, 1] units
## Window area = 1 square unit
## Unit of length: 924 feet
This data includes 2251 trees of six different species.
plot(lansing)
We can easily plot the point process for each species (stored as a factor) using the split()
function:
spp <- split(lansing)
plot(spp[1:4], main = "Tree location by type")
To extract one of the patterns:
hick <- spp$hickory
plot(hick)
We can plot the densities of each type of tree by calling:
plot(density(spp[1:4]), main = "Densities of tree type")
Cool. We can also extract information about the marks of points surrounding a point. R
indicates the neighborhood radius to use:
m <- marktable(lansing, R = 0.1)
m[1:10,]
## mark
## point blackoak hickory maple misc redoak whiteoak
## 1 1 17 17 8 20 17
## 2 1 9 18 5 14 28
## 3 1 10 19 6 17 23
## 4 1 10 0 0 8 10
## 5 2 18 0 0 4 15
## 6 6 32 1 0 9 19
## 7 9 38 1 0 10 8
## 8 7 34 1 0 10 16
## 9 3 32 0 0 10 15
## 10 4 42 0 0 11 15
Now let’s move to a data set that contains continuous marked data. The longleaf
data set contains the locations and sizes (diameter) of Longleaf Pines. markstat
can be used to compute the average value of a mark within some distance of each point in the data set. In this case, what’s the count of each species within a certain distance of each point:
data(longleaf)
md <- markstat(longleaf, mean, N = 5)
md[1:10]
## [1] 43.40 43.40 48.58 21.70 48.38 53.32 40.28 29.82 24.92 21.70
We can also visually inspect the distribution of our continuous marks:
hist(marks(longleaf))
To study the relationship between two or more density plots, consider the pairs()
function:
pairs(density(split(lansing)[c(2,3,5)]))
This command divides the lansing
data set into each tree species. It computed the smoothed kernel intensity estimate for each species and displayed scatterplots of the each species pair. The plots look weird because we’re looking at how two density surfaces interact. Think of this as a 3D covariance matrix. The shapes that don’t look like stingrays (hickory and maple interactions) are indicative of species separation, i.e. hickory and maple are rarely found in the same space since a high density of hickories is associated with a low density of maples).
We can also compare the distance between two point pattern objects \(X\) and \(Y\). crossdist(X,Y)
computes a matrix of distances from each point in \(X\) to each point in \(Y\) and ncross(X,Y)
finds for each point in \(X\) the nearest point of \(Y\) and the distance to this point.
It’s often quite helpful to determine whether the intensity of our point pattern process is driven by other covariates. Could our crime intensity be driven by socioeconomic variables? Could the location of tree species be driven by topography? In both cases, the covarying data is not stored in the ppp
object, but has to be brought in (as a raster or shapefile) from another data source. We can start answering very important and interesting questions once we bring in this additional data. Let’s start with a pre-made example from the spatstat
package. Let’s say we have data on tropical tree location and slope:
data(bei) #in spatstat package
slope <- bei.extra$grad
b <- quantile(slope, probs = (0:4)/4)
slope_break <- cut(slope, breaks = b, labels = 1:4)
v <- tess(image = slope_break)
plot(v)
plot(bei, add=T, pch = "+")
So what are we doing here? After we load our data, we create an object b
that stores sample quantiles corresponding to the given probabilities (probs
), in our case, 0%, 25%, 50%, 75%, and 100%, of slope having a certain value. We then use cut()
from the raster
package to reclassify the raster slope object with values 1 to 4 based on which quantile of data they belong to. The tess()
or tessellation function creates a collection of disjoint spatial regions, in this case, each region is one of the four categories in the raster image we have created. We can then count the number of points in each tessellation. If the point pattern process was uniform, we’d expect a near-equal number of points in each slope category (1, 2, 3, 4). We can use the quadratcount()
function to count the number of points in each category of slope:
qb <- quadratcount(bei, tess=v)
qb
## tile
## 1 2 3 4
## 271 984 1028 1321
Looks like there are big differences in tree locations. It appears that trees in this dataset prefer steeper slopes (3 and 4).
This is one way to get a rough understanding of how a covariates relates to a point process. I also like estimating, across a gradient of covariate values, the prevalence of points. Here’s some quick math to explain this. Imagine the intensity of a point process is, in fact, a function of a covariate (i.e. tree density is a function of our slope). Cool! How do we test this empirically? One quick way is to compute a function that tells us how the intensity of points depends on the value of the covariates. At any spatial location \(u\), let \(\lambda(u)\) describe the intensity of the point process and \(Z(u)\) be the value of the covariate (slope, in our case). Then what we are interested in is this relationship:
\[
\lambda(u) = \rho(Z(u))
\] where \(\rho\) is the function of interest. The spatstat
function estimates the intensity as a function of a covariate using the rhohat()
function:
plot(rhohat(bei, slope))
In ecology, this is often known as a resource selection function if the points are organisms and the covariates are a descriptor of the environment or habitat. If you decide to use this method, then be sure to read about the non-parametric smoothing estimator used to generate this function. The smoother we specify changes the function - as always, it’s crucial you understand what assumptions underlie your analyses and visualizations and that you ensure the approach is suitable for your data set.
Here’s another cool thing we can do. Remember our friend Kolmogorov-Smirnov? We can use this test to determine whether the point-pattern intensity of interest to us depends on a specific covariate. We’ve collected evidence above that suggests that tropical tree location depends on slope. We can formally test this by dividing applying the K-S test using slope as our covariate (instead of coordinates or density as we used above):
ks <- cdf.test(bei, slope)
plot(ks)
NOTE, if you’re working with continuous variables (slope), then KS is your guy. If you’ve got categorial variables, you’ll want to consider the classic \(\chi^2\) test. Again, this test shows that the relationship between slope and tree locations is not random, i.e. there is some significant relationship between the two. If you are interested in exploring how your point data relates to spatial covariates in more detail, I recommend you read the section on Maximum likelihood for Poisson processes in the CS text, starting on pg. 95. Think of this as regression for point processes… and remember, as is always the case, that to do this sort of analysis you have to make fairly strong assumptions about your data.
R
. I relied heavily on this text for the creation of this tutorial, as well as your required pre-assignment reading in the RSpatial textbook.R
. More information can be found here.If you really get into point pattern analysis, you can change this. For example, you can estimate an adaptive estimator of intensity which uses a fraction \(f\) of the data to construct a Dirichlet tessellation and estimates a constant intensity in each tile of the tessellation.↩