Skip to content

Commit ad5fc02

Browse files
author
David O'Sullivan
committed
updated statistical models materials for 2023
1 parent 67aa719 commit ad5fc02

File tree

3 files changed

+40
-6
lines changed

3 files changed

+40
-6
lines changed

labs/statistical-models/README.md

Lines changed: 5 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
# Statistical models
33
Some slides from another class to look at this week
4-
+ [From overlay to regression](https://southosullivan.com/geog315/from-overlay-to-regression/)
5-
+ [Introduction to regression](https://southosullivan.com/geog315/regression/)
6-
+ [More on regression](https://southosullivan.com/geog315/more-on-regression/)
4+
5+
+ [From overlay to regression](https://dosull.github.io/Geog315/slides/from-overlay-to-regression/)
6+
+ [Introduction to regression](https://dosull.github.io/Geog315/slides/regression/)
7+
+ [More on regression](https://dosull.github.io/Geog315/slides/more-on-regression/)
78

89
And then [a notebook to explore](statistical-models.md). If you [download the zip file](statistical-models.zip?raw=true) then you'll find the data and the RMarkdown version of the notebook in there too.

labs/statistical-models/statistical-models.Rmd

Lines changed: 17 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
# Statistical models
3+
34
```{r message = FALSE}
45
library(raster)
56
library(sf)
@@ -20,6 +21,7 @@ In class I will speak briefly to these before diving into the material below.
2021

2122
## Environment layers
2223
There are a bunch of raster data sets in a folder called `layers`. We can read them all into a raster `stack` by listing the directory as follows
24+
2325
```{r message = FALSE, warning = FALSE}
2426
layer_sources <- file.path("layers", dir(path = "layers"))
2527
layers <- stack(layer_sources)
@@ -44,18 +46,21 @@ tseas | Temperature seasonality | degrees C
4446
vpd | Mean October vapor pressure deficit at 9 AM | kPa
4547

4648
We can map any particular layer of interest as follows:
49+
4750
```{r}
4851
tmap_mode("view")
4952
tm_shape(layers$dem) +
5053
tm_raster(palette = "Oranges", style = "cont", alpha = 0.8)
5154
```
5255

5356
An entire stack of layers like this can be mapped in one go with a simple `plot` command:
57+
5458
```{r}
5559
plot(layers)
5660
```
5761

5862
or if you like a bit more cartographic polish use `tmap` (but do this in plot mode, and it might be slow anyway...):
63+
5964
```{r warning = FALSE, message = FALSE}
6065
tmap_mode('plot') # best to do this in plot mode
6166
tm_shape(layers) +
@@ -68,6 +73,7 @@ tm_shape(layers) +
6873

6974
## Plant presence-absence data
7075
I obtained presence-absence data for the mysterious 'nz35' species from the [`disdat` package]() with some more details but not plant identities, unfortunately in [this paper](https://dx.doi.org/10.17161/bi.v15i2.13384).
76+
7177
```{r}
7278
plants <- st_read("nz35-pa.gpkg")
7379
plants.d <- plants %>%
@@ -78,6 +84,7 @@ plants.d <- plants %>%
7884
In this dataset the attribute `nz35` is 1 where the plant has been observed and 0 where it has not (or where synthetic absence data has been generated).
7985

8086
Map these on any chosen environment layer like this
87+
8188
```{r}
8289
tmap_mode("view")
8390
tm_shape(layers$dem) +
@@ -91,6 +98,7 @@ tm_shape(layers$dem) +
9198
If we are interested in the distribution of this species (whether because it is invasive, or because it is endangered!) then this is a classic GIS setting for doing some kind of overlay analysis.
9299

93100
We might for example based on inspection, or expert knowledge, or on some other basis choose cutoff values in each environmental layer and make binary maps for each. For example
101+
94102
```{r}
95103
dem_bin <- layers$dem < 800
96104
tm_shape(dem_bin) +
@@ -110,12 +118,14 @@ plants.d <- plants.d %>%
110118
```
111119

112120
We can then do things like
121+
113122
```{r}
114123
ggplot(plants.d) +
115124
geom_boxplot(aes(x = nz35, y = dem, group = nz35))
116125
```
117126

118127
If we wanted the 'full picture' in this way, we can also do that
128+
119129
```{r}
120130
plants.d.long <- plants.d %>%
121131
pivot_longer(-nz35)
@@ -150,6 +160,7 @@ predicted_presence <- predict(layers, logistic_model, type = "response")
150160
```
151161

152162
And we can map this result in the usual way
163+
153164
```{r}
154165
tm_shape(predicted_presence) +
155166
tm_raster(palette = "YlOrRd")
@@ -158,26 +169,30 @@ tm_shape(predicted_presence) +
158169
Fitting models and chooseing which of the many possible ones we could is a complex process, based on expertise, on the data and on measures of model quality, In the summary above the results suggest that `mas` is more important to the result, than `dem`. Building on this, we might drop `dem` and try again adding in some other factor, or several other factors.
159170

160171
We could also use automated approaches. A base R function for this is `step` which given a base model will try dropping variables to find a best model:
172+
161173
```{r}
162174
step(logistic_model)
163175
```
164176

165177
This approach uses a measure of model quality (AIC, Akaike's Information Criterion *smaller is better*) and in the above example, when the stepping process tries dropping either variable AIC gets worse (AIC gets higher), so it suggests that working with those two variables only, the best model is one that includes both.
166178

167179
This makes it tempting to go all in:
180+
168181
```{r}
169182
logistic_model <- glm(nz35 ~ age + deficit + dem + mas + mat + r2pet + rain + slope + sseas + tseas + vpd, data = plants.d, family = "binomial")
170183
step(logistic_model)
171184
```
172185

173186
There are good reasons not to do this, but just to see what we end up with the resulting model is
187+
174188
```{r}
175189
logistic_model <- glm(formula = nz35 ~ deficit + dem + mat + rain + sseas + tseas,
176190
family = "binomial", data = plants.d)
177191
predicted_presence <- predict(layers, logistic_model, type = "response")
178192
```
179193

180194
And map as before
195+
181196
```{r}
182197
tm_shape(predicted_presence) +
183198
tm_raster(palette = "Greys") +
@@ -186,6 +201,7 @@ tm_shape(predicted_presence) +
186201
```
187202

188203
Measures of how good a model this is in its ability to predict accurately depend on how well, as we change the decision threshold (i.e. what probability value of the predicted result we use to predict 'presence') we do in predicting true positives and false positives. This can be summarised using an 'area under the curve' statistic available in the `pROC` package:
204+
189205
```{r message = FALSE}
190206
library(pROC)
191207
x <- roc(plants.d$nz35, fitted(logistic_model))

labs/statistical-models/statistical-models.md

Lines changed: 18 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
# Statistical models
3+
34
```{r message = FALSE}
45
library(raster)
56
library(sf)
@@ -20,6 +21,7 @@ In class I will speak briefly to these before diving into the material below.
2021

2122
## Environment layers
2223
There are a bunch of raster data sets in a folder called `layers`. We can read them all into a raster `stack` by listing the directory as follows
24+
2325
```{r message = FALSE, warning = FALSE}
2426
layer_sources <- file.path("layers", dir(path = "layers"))
2527
layers <- stack(layer_sources)
@@ -44,29 +46,34 @@ tseas | Temperature seasonality | degrees C
4446
vpd | Mean October vapor pressure deficit at 9 AM | kPa
4547

4648
We can map any particular layer of interest as follows:
49+
4750
```{r}
4851
tmap_mode("view")
4952
tm_shape(layers$dem) +
5053
tm_raster(palette = "Oranges", style = "cont", alpha = 0.8)
5154
```
5255

5356
An entire stack of layers like this can be mapped in one go with a simple `plot` command:
57+
5458
```{r}
5559
plot(layers)
5660
```
5761

5862
or if you like a bit more cartographic polish use `tmap` (but do this in plot mode, and it might be slow anyway...):
63+
5964
```{r warning = FALSE, message = FALSE}
6065
tmap_mode('plot') # best to do this in plot mode
6166
tm_shape(layers) +
6267
tm_raster(title = names(layers)) +
68+
tm_layout(legend.position = c("RIGHT", "BOTTOM")) +
6369
tm_facets(free.scales = TRUE) # this allows a different scale for each layer
6470
```
6571

6672
(Note that we shoudln't have to supply the `title = ...` setting, but [there seems to be a bug](https://github.com/mtennekes/tmap/issues/166) and this seems to work around it.)
6773

6874
## Plant presence-absence data
6975
I obtained presence-absence data for the mysterious 'nz35' species from the [`disdat` package]() with some more details but not plant identities, unfortunately in [this paper](https://dx.doi.org/10.17161/bi.v15i2.13384).
76+
7077
```{r}
7178
plants <- st_read("nz35-pa.gpkg")
7279
plants.d <- plants %>%
@@ -77,6 +84,7 @@ plants.d <- plants %>%
7784
In this dataset the attribute `nz35` is 1 where the plant has been observed and 0 where it has not (or where synthetic absence data has been generated).
7885

7986
Map these on any chosen environment layer like this
87+
8088
```{r}
8189
tmap_mode("view")
8290
tm_shape(layers$dem) +
@@ -90,6 +98,7 @@ tm_shape(layers$dem) +
9098
If we are interested in the distribution of this species (whether because it is invasive, or because it is endangered!) then this is a classic GIS setting for doing some kind of overlay analysis.
9199

92100
We might for example based on inspection, or expert knowledge, or on some other basis choose cutoff values in each environmental layer and make binary maps for each. For example
101+
93102
```{r}
94103
dem_bin <- layers$dem < 800
95104
tm_shape(dem_bin) +
@@ -109,12 +118,14 @@ plants.d <- plants.d %>%
109118
```
110119

111120
We can then do things like
121+
112122
```{r}
113123
ggplot(plants.d) +
114124
geom_boxplot(aes(x = nz35, y = dem, group = nz35))
115125
```
116126

117127
If we wanted the 'full picture' in this way, we can also do that
128+
118129
```{r}
119130
plants.d.long <- plants.d %>%
120131
pivot_longer(-nz35)
@@ -149,6 +160,7 @@ predicted_presence <- predict(layers, logistic_model, type = "response")
149160
```
150161

151162
And we can map this result in the usual way
163+
152164
```{r}
153165
tm_shape(predicted_presence) +
154166
tm_raster(palette = "YlOrRd")
@@ -157,26 +169,30 @@ tm_shape(predicted_presence) +
157169
Fitting models and chooseing which of the many possible ones we could is a complex process, based on expertise, on the data and on measures of model quality, In the summary above the results suggest that `mas` is more important to the result, than `dem`. Building on this, we might drop `dem` and try again adding in some other factor, or several other factors.
158170

159171
We could also use automated approaches. A base R function for this is `step` which given a base model will try dropping variables to find a best model:
172+
160173
```{r}
161174
step(logistic_model)
162175
```
163176

164177
This approach uses a measure of model quality (AIC, Akaike's Information Criterion *smaller is better*) and in the above example, when the stepping process tries dropping either variable AIC gets worse (AIC gets higher), so it suggests that working with those two variables only, the best model is one that includes both.
165178

166179
This makes it tempting to go all in:
180+
167181
```{r}
168182
logistic_model <- glm(nz35 ~ age + deficit + dem + mas + mat + r2pet + rain + slope + sseas + tseas + vpd, data = plants.d, family = "binomial")
169183
step(logistic_model)
170184
```
171185

172186
There are good reasons not to do this, but just to see what we end up with the resulting model is
187+
173188
```{r}
174189
logistic_model <- glm(formula = nz35 ~ deficit + dem + mat + rain + sseas + tseas,
175190
family = "binomial", data = plants.d)
176191
predicted_presence <- predict(layers, logistic_model, type = "response")
177192
```
178193

179194
And map as before
195+
180196
```{r}
181197
tm_shape(predicted_presence) +
182198
tm_raster(palette = "Greys") +
@@ -185,6 +201,7 @@ tm_shape(predicted_presence) +
185201
```
186202

187203
Measures of how good a model this is in its ability to predict accurately depend on how well, as we change the decision threshold (i.e. what probability value of the predicted result we use to predict 'presence') we do in predicting true positives and false positives. This can be summarised using an 'area under the curve' statistic available in the `pROC` package:
204+
188205
```{r message = FALSE}
189206
library(pROC)
190207
x <- roc(plants.d$nz35, fitted(logistic_model))

0 commit comments

Comments
 (0)