You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Some slides from another class to look at this week
4
-
+[From overlay to regression](https://southosullivan.com/geog315/from-overlay-to-regression/)
5
-
+[Introduction to regression](https://southosullivan.com/geog315/regression/)
6
-
+[More on regression](https://southosullivan.com/geog315/more-on-regression/)
4
+
5
+
+[From overlay to regression](https://dosull.github.io/Geog315/slides/from-overlay-to-regression/)
6
+
+[Introduction to regression](https://dosull.github.io/Geog315/slides/regression/)
7
+
+[More on regression](https://dosull.github.io/Geog315/slides/more-on-regression/)
7
8
8
9
And then [a notebook to explore](statistical-models.md). If you [download the zip file](statistical-models.zip?raw=true) then you'll find the data and the RMarkdown version of the notebook in there too.
An entire stack of layers like this can be mapped in one go with a simple `plot` command:
57
+
54
58
```{r}
55
59
plot(layers)
56
60
```
57
61
58
62
or if you like a bit more cartographic polish use `tmap` (but do this in plot mode, and it might be slow anyway...):
63
+
59
64
```{r warning = FALSE, message = FALSE}
60
65
tmap_mode('plot') # best to do this in plot mode
61
66
tm_shape(layers) +
@@ -68,6 +73,7 @@ tm_shape(layers) +
68
73
69
74
## Plant presence-absence data
70
75
I obtained presence-absence data for the mysterious 'nz35' species from the [`disdat` package]() with some more details but not plant identities, unfortunately in [this paper](https://dx.doi.org/10.17161/bi.v15i2.13384).
76
+
71
77
```{r}
72
78
plants <- st_read("nz35-pa.gpkg")
73
79
plants.d <- plants %>%
@@ -78,6 +84,7 @@ plants.d <- plants %>%
78
84
In this dataset the attribute `nz35` is 1 where the plant has been observed and 0 where it has not (or where synthetic absence data has been generated).
79
85
80
86
Map these on any chosen environment layer like this
87
+
81
88
```{r}
82
89
tmap_mode("view")
83
90
tm_shape(layers$dem) +
@@ -91,6 +98,7 @@ tm_shape(layers$dem) +
91
98
If we are interested in the distribution of this species (whether because it is invasive, or because it is endangered!) then this is a classic GIS setting for doing some kind of overlay analysis.
92
99
93
100
We might for example based on inspection, or expert knowledge, or on some other basis choose cutoff values in each environmental layer and make binary maps for each. For example
101
+
94
102
```{r}
95
103
dem_bin <- layers$dem < 800
96
104
tm_shape(dem_bin) +
@@ -110,12 +118,14 @@ plants.d <- plants.d %>%
110
118
```
111
119
112
120
We can then do things like
121
+
113
122
```{r}
114
123
ggplot(plants.d) +
115
124
geom_boxplot(aes(x = nz35, y = dem, group = nz35))
116
125
```
117
126
118
127
If we wanted the 'full picture' in this way, we can also do that
Fitting models and chooseing which of the many possible ones we could is a complex process, based on expertise, on the data and on measures of model quality, In the summary above the results suggest that `mas` is more important to the result, than `dem`. Building on this, we might drop `dem` and try again adding in some other factor, or several other factors.
159
170
160
171
We could also use automated approaches. A base R function for this is `step` which given a base model will try dropping variables to find a best model:
172
+
161
173
```{r}
162
174
step(logistic_model)
163
175
```
164
176
165
177
This approach uses a measure of model quality (AIC, Akaike's Information Criterion *smaller is better*) and in the above example, when the stepping process tries dropping either variable AIC gets worse (AIC gets higher), so it suggests that working with those two variables only, the best model is one that includes both.
166
178
167
179
This makes it tempting to go all in:
180
+
168
181
```{r}
169
182
logistic_model <- glm(nz35 ~ age + deficit + dem + mas + mat + r2pet + rain + slope + sseas + tseas + vpd, data = plants.d, family = "binomial")
170
183
step(logistic_model)
171
184
```
172
185
173
186
There are good reasons not to do this, but just to see what we end up with the resulting model is
187
+
174
188
```{r}
175
189
logistic_model <- glm(formula = nz35 ~ deficit + dem + mat + rain + sseas + tseas,
176
190
family = "binomial", data = plants.d)
177
191
predicted_presence <- predict(layers, logistic_model, type = "response")
Measures of how good a model this is in its ability to predict accurately depend on how well, as we change the decision threshold (i.e. what probability value of the predicted result we use to predict 'presence') we do in predicting true positives and false positives. This can be summarised using an 'area under the curve' statistic available in the `pROC` package:
tm_facets(free.scales = TRUE) # this allows a different scale for each layer
64
70
```
65
71
66
72
(Note that we shoudln't have to supply the `title = ...` setting, but [there seems to be a bug](https://github.com/mtennekes/tmap/issues/166) and this seems to work around it.)
67
73
68
74
## Plant presence-absence data
69
75
I obtained presence-absence data for the mysterious 'nz35' species from the [`disdat` package]() with some more details but not plant identities, unfortunately in [this paper](https://dx.doi.org/10.17161/bi.v15i2.13384).
76
+
70
77
```{r}
71
78
plants <- st_read("nz35-pa.gpkg")
72
79
plants.d <- plants %>%
@@ -77,6 +84,7 @@ plants.d <- plants %>%
77
84
In this dataset the attribute `nz35` is 1 where the plant has been observed and 0 where it has not (or where synthetic absence data has been generated).
78
85
79
86
Map these on any chosen environment layer like this
87
+
80
88
```{r}
81
89
tmap_mode("view")
82
90
tm_shape(layers$dem) +
@@ -90,6 +98,7 @@ tm_shape(layers$dem) +
90
98
If we are interested in the distribution of this species (whether because it is invasive, or because it is endangered!) then this is a classic GIS setting for doing some kind of overlay analysis.
91
99
92
100
We might for example based on inspection, or expert knowledge, or on some other basis choose cutoff values in each environmental layer and make binary maps for each. For example
101
+
93
102
```{r}
94
103
dem_bin <- layers$dem < 800
95
104
tm_shape(dem_bin) +
@@ -109,12 +118,14 @@ plants.d <- plants.d %>%
109
118
```
110
119
111
120
We can then do things like
121
+
112
122
```{r}
113
123
ggplot(plants.d) +
114
124
geom_boxplot(aes(x = nz35, y = dem, group = nz35))
115
125
```
116
126
117
127
If we wanted the 'full picture' in this way, we can also do that
Fitting models and chooseing which of the many possible ones we could is a complex process, based on expertise, on the data and on measures of model quality, In the summary above the results suggest that `mas` is more important to the result, than `dem`. Building on this, we might drop `dem` and try again adding in some other factor, or several other factors.
158
170
159
171
We could also use automated approaches. A base R function for this is `step` which given a base model will try dropping variables to find a best model:
172
+
160
173
```{r}
161
174
step(logistic_model)
162
175
```
163
176
164
177
This approach uses a measure of model quality (AIC, Akaike's Information Criterion *smaller is better*) and in the above example, when the stepping process tries dropping either variable AIC gets worse (AIC gets higher), so it suggests that working with those two variables only, the best model is one that includes both.
165
178
166
179
This makes it tempting to go all in:
180
+
167
181
```{r}
168
182
logistic_model <- glm(nz35 ~ age + deficit + dem + mas + mat + r2pet + rain + slope + sseas + tseas + vpd, data = plants.d, family = "binomial")
169
183
step(logistic_model)
170
184
```
171
185
172
186
There are good reasons not to do this, but just to see what we end up with the resulting model is
187
+
173
188
```{r}
174
189
logistic_model <- glm(formula = nz35 ~ deficit + dem + mat + rain + sseas + tseas,
175
190
family = "binomial", data = plants.d)
176
191
predicted_presence <- predict(layers, logistic_model, type = "response")
Measures of how good a model this is in its ability to predict accurately depend on how well, as we change the decision threshold (i.e. what probability value of the predicted result we use to predict 'presence') we do in predicting true positives and false positives. This can be summarised using an 'area under the curve' statistic available in the `pROC` package:
0 commit comments