Skip to content

Commit

Permalink
Merge pull request #357 from Robinlovelace/revert-356-copyedit_c14
Browse files Browse the repository at this point in the history
Revert "Copyedit c14"
  • Loading branch information
jannes-m authored Dec 22, 2018
2 parents ae32484 + 5715786 commit 815d44a
Showing 1 changed file with 18 additions and 20 deletions.
38 changes: 18 additions & 20 deletions 14-eco.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
## Prerequisites {-}

This chapter assumes you have a strong grasp of geographic data analysis and processing, covered in Chapters \@ref(spatial-class) to \@ref(geometric-operations).
In it you will also make use of R's interfaces to dedicated GIS software, and spatial cross validation, topics covered in Chapters \@ref(gis) and \@ref(spatial-cv), respectively.
In it you will also make use of R's interfaces to dedicated GIS software, and spatial cross-validation, topics covered in Chapters \@ref(gis) and \@ref(spatial-cv), respectively.

The chapter uses the following packages:

Expand All @@ -25,7 +25,7 @@ Fog oases are one of the most fascinating vegetation formations we have ever enc
These formations, locally termed *lomas*, develop on mountains along the coastal deserts of Peru and Chile.^[Similar vegetation formations develop also in other parts of the world, e.g., in Namibia and along the coasts of Yemen and Oman [@galletti_land_2016].]
The deserts' extreme conditions and remoteness provide the habitat for a unique ecosystem, including species endemic to the fog oases.
Despite the arid conditions and low levels of precipitation of around 30-50 mm per year on average, fog deposition increases the amount of water available to plants during austal winter.
This results in green southern-facing mountain slopes along the coastal strip of Peru (Fig. \@ref(fig:study-area-mongon)).
This results in green southern-facing mountain slopes along the coastal strip of Peru (Figure \@ref(fig:study-area-mongon)).
This fog, which develops below the temperature inversion caused by the cold Humboldt current in austral winter, provides the name for this habitat.
Every few years, the El Niño phenomenon brings torrential rainfall to this sun-baked environment [@dillon_lomas_2003].
This causes the desert to bloom, and provides tree seedlings a chance to develop roots long enough to survive the following arid conditions.
Expand All @@ -37,7 +37,7 @@ To effectively protect the last remnants of this unique vegetation ecosystem, ev
For example, most Peruvians live in the coastal desert, and *lomas* mountains are frequently the closest "green" destination.

In this chapter we will demonstrate ecological applications of some of the techniques learned in the previous chapters.
This case study will involve analyzing the composition and the spatial distribution of the vascular plants on the southern slope of Mt. Mongón, a *lomas* mountain near Casma on the central northern coast of Peru (Fig. \@ref(fig:study-area-mongon)).
This case study will involve analyzing the composition and the spatial distribution of the vascular plants on the southern slope of Mt. Mongón, a *lomas* mountain near Casma on the central northern coast of Peru (Figure \@ref(fig:study-area-mongon)).

```{r study-area-mongon, echo=FALSE, fig.cap="The Mt. Mongón study area, from Muenchow, Schratz, and Brenning (2017).", out.width="60%", fig.scap="The Mt. Mongón study area."}
knitr::include_graphics("https://user-images.githubusercontent.com/1825120/38989956-6eae7c9a-43d0-11e8-8f25-3dd3594f7e74.png")
Expand Down Expand Up @@ -68,8 +68,7 @@ data("study_area", "random_points", "comm", "dem", "ndvi", package = "RQGIS")

`study_area` is an `sf` polygon representing the outlines of the study area.
`random_points` is an `sf` object, and contains the 100 randomly chosen sites.
`comm` is a community matrix of the wide data format [@wickham_tidy_2014] where the rows represent the visited sites in the field and the columns the observed species.
^[In statistics, this is also called a contingency table or cross-table.]
`comm` is a community matrix of the wide data format [@wickham_tidy_2014] where the rows represent the visited sites in the field and the columns the observed species.^[In statistics, this is also called a contingency table or cross-table.]

```{r}
# sites 35 to 40 and corresponding occurrences of the first five species in the
Expand All @@ -79,7 +78,7 @@ comm[35:40, 1:5]

The values represent species cover per site, and were recorded as the area covered by a species in proportion to the site area in percentage points (%; please note that one site can have >100% due to overlapping cover between individual plants).
The rownames of `comm` correspond to the `id` column of `random_points`.
`dem` is the digital elevation model for the study area, and `ndvi` is the Normalized Difference Vegetation Index (NDVI) computed from the red and near-infrared channels of a Landsat scene (see Section \@ref(local-operations) and `?ndvi`).
`dem` is the digital elevation model (DEM) for the study area, and `ndvi` is the Normalized Difference Vegetation Index (NDVI) computed from the red and near-infrared channels of a Landsat scene (see Section \@ref(local-operations) and `?ndvi`).
Visualizing the data helps to get more familiar with it, as shown in Figure \@ref(fig:sa-mongon) where the `dem` is overplotted by the `random_points` and the `study_area`.

```{r, eval=FALSE, echo=FALSE}
Expand Down Expand Up @@ -120,13 +119,12 @@ grid.text("m asl", x = unit(0.8, "npc"), y = unit(0.75, "npc"),
```

The next step is to compute variables which we will predominantly need for the modeling and predictive mapping (see Section \@ref(predictive-mapping)) but also for aligning the NMDS axes with the main gradient, altitude and humidity, respectively, in the study area (see Section \@ref(nmds)).
The next step is to compute variables which we will not only need for the modeling and predictive mapping (see Section \@ref(predictive-mapping)) but also for aligning the NMDS axes with the main gradient in the study area, altitude and humidity, respectively (see Section \@ref(nmds)).

Specifically, we will compute catchment slope and catchment area from a digital elevation model using R-GIS bridges (see Chapter \@ref(gis)).
Curvatures might also represent valuable predictors, in the exercise section you can find out how they would change the modeling result.
Curvatures might also represent valuable predictors, in the Exercise section you can find out how they would change the modeling result.

To compute catchment area and catchment slope, we will make use of the `saga:sagawetnessindex` function.
^[Admittedly, it is a bit unsatisfying that the only way of knowing that `sagawetnessindex` computes the desired terrain attributes is to be familiar with SAGA.]
To compute catchment area and catchment slope, we will make use of the `saga:sagawetnessindex` function.^[Admittedly, it is a bit unsatisfying that the only way of knowing that `sagawetnessindex` computes the desired terrain attributes is to be familiar with SAGA.]
`get_usage()` returns all function parameters and default values of a specific geoalgorithm.
Here, we present only a selection of the complete output.

Expand Down Expand Up @@ -196,7 +194,7 @@ random_points[, names(ep)] = raster::extract(ep, as(random_points, "Spatial"))

Ordinations are a popular tool in vegetation science to extract the main information, frequently corresponding to ecological gradients, from large species-plot matrices mostly filled with 0s.
However, they are also used in remote sensing, the soil sciences, geomarketing and many other fields.
If you are unfamiliar with ordination techniques or in need of a refresher, have a look at Michael W. Palmers [webpage](http://ordination.okstate.edu/overview.htm) for a short introduction to popular ordination techniques in ecology and at @borcard_numerical_2011 for a deeper look on how to apply these techniques in R.
If you are unfamiliar with ordination techniques or in need of a refresher, have a look at Michael W. Palmer's [web page](http://ordination.okstate.edu/overview.htm) for a short introduction to popular ordination techniques in ecology and at @borcard_numerical_2011 for a deeper look on how to apply these techniques in R.
**vegan**'s package documentation is also a very helpful resource (`vignette(package = "vegan")`).

Principal component analysis (PCA) is probably the most famous ordination technique.
Expand All @@ -207,7 +205,7 @@ For one, relationships are usually non-linear along environmental gradients.
That means the presence of a plant usually follows a unimodal relationship along a gradient (e.g., humidity, temperature or salinity) with a peak at the most favorable conditions and declining ends towards the unfavorable conditions.

Secondly, the joint absence of a species in two plots is hardly an indication for similarity.
Suppose a plant species is absent from the driest (e.g., an extreme desert) and the most moist locations (e.g., a tree savanna) of our sampling.
Suppose a plant species is absent from the driest (e.g., an extreme desert) and the most moistest locations (e.g., a tree savanna) of our sampling.
Then we really should refrain from counting this as a similarity because it is very likely that the only thing these two completely different environmental settings have in common in terms of floristic composition is the shared absence of species (except for rare ubiquitous species).

Non-metric multidimensional scaling (NMDS) is one popular dimension-reducing technique in ecology [@vonwehrden_pluralism_2009].
Expand All @@ -217,7 +215,7 @@ The lower the stress value, the better the ordination, i.e., the low-dimensional
Stress values lower than 10 represent an excellent fit, stress values of around 15 are still good, and values greater than 20 represent a poor fit [@mccune_analysis_2002].
In R, `metaMDS()` of the **vegan** package can execute a NMDS.
As input, it expects a community matrix with the sites as rows and the species as columns.
Often ordinations using presence-absence data yield better results (in terms of explained variance) though the prize is, of course, a less informative input matrix (see also exercises).
Often ordinations using presence-absence data yield better results (in terms of explained variance) though the prize is, of course, a less informative input matrix (see also Exercises).
`decostand()` converts numerical observations into presences and absences with 1 indicating the occurrence of a species and 0 the absence of a species.
Ordination techniques such as NMDS require at least one observation per site.
Hence, we need to dismiss all sites in which no species were found.
Expand Down Expand Up @@ -396,12 +394,12 @@ for each split of the tree—in other words, that bagging should be done.
### **mlr** building blocks

The code in this section largely follows the steps we have introduced in Section \@ref(svm).
The only differences are:
The only differences are the following:

1. The response variable is numeric, hence a regression task will replace the classification task of Section \@ref(svm).
1. Instead of the AUROC which can only be used for categorical response variables, we will use the root mean squared error (RMSE) as performance measure.
1. We use a random forest model instead of a support vector machine which naturally goes along with different hyperparameters.
1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see exercises).
1. We are leaving the assessment of a bias-reduced performance measure as an exercise to the reader (see Exercises).
Instead we show how to tune hyperparameters for (spatial) predictions.

Remember that 125,500 models were necessary to retrieve bias-reduced performance estimates when using 100-repeated 5-fold spatial cross-validation and a random search of 50 iterations (see Section \@ref(svm)).
Expand Down Expand Up @@ -587,7 +585,7 @@ grid.text("NMDS1", x = unit(0.75, "npc"), y = unit(0.75, "npc"),
```

The predictive mapping clearly reveals distinct vegetation belts (Figure \@ref(fig:rf-pred)).
Please refer to @muenchow_soil_2013 for a detailed descriptions of vegetation belts on **lomas** mountains.
Please refer to @muenchow_soil_2013 for a detailed description of vegetation belts on **lomas** mountains.
The blue color tones represent the so-called *Tillandsia*-belt.
*Tillandsia* is a highly adapted genus especially found in high quantities at the sandy and quite desertic foot of *lomas* mountains.
The yellow color tones refer to a herbaceous vegetation belt with a much higher plant cover compared to the *Tillandsia*-belt.
Expand All @@ -610,18 +608,18 @@ In terms of methodology, a few additional points could be addressed:
- It would be interesting to also model the second ordination axis, and to subsequently find an innovative way of visualizing jointly the modeled scores of the two axes in one prediction map.
- If we were interested in interpreting the model in an ecologically meaningful way, we should probably use (semi-)parametric models [@muenchow_predictive_2013;@zuur_mixed_2009;@zuur_beginners_2017].
However, there are at least approaches that help to interpret machine learning models such as random forests (see, e.g., [https://mlr-org.github.io/interpretable-machine-learning-iml-and-mlr/](https://mlr-org.github.io/interpretable-machine-learning-iml-and-mlr/)).
- A sequential model-based optimization (SMBO) might be preferable to the here used random search for hyperparameter optimization [@probst_hyperparameters_2018].
- A sequential model-based optimization (SMBO) might be preferable to the random search for hyperparameter optimization used in this chapter [@probst_hyperparameters_2018].

Finally, please note that random forest and other machine-learning models are frequently used in a setting with lots of observations and many predictors, much more than used in this chapter, and where it is unclear which variables and variable interactions contribute to explaining the response.
Additionally, the relationships might be highly non-linear.
In our use case, the relationship between response and predictors are pretty clear, there is only a slight amount of non-linearity and the number of observations and predictors is low.
Hence, it might be worth to try a linear model.
A linear model is much easier to explain and understand than a random forest model, and therefore to be preferred (law of parsimony), additionally it is computationally less demanding (see exercises).
Hence, it might be worth trying a linear model.
A linear model is much easier to explain and understand than a random forest model, and therefore to be preferred (law of parsimony), additionally it is computationally less demanding (see Exercises).
If the linear model cannot cope with the degree of non-linearity present in the data, one could also try a generalized additive model (GAM).
The point here is that the toolbox of a data scientist consists of more than one tool, and it is your responsibility to select the tool best suited for the task or purpose at hand.
Here, we wanted to introduce the reader to random forest modeling and how to use the corresponding results for spatial predictions.
For this purpose, a well-studied dataset with known relationships between response and predictors, is appropriate.
However, this does not imply that the random forest model has returned the best result in terms of predictive performance (see exercises).
However, this does not imply that the random forest model has returned the best result in terms of predictive performance (see Exercises).

## Exercises

Expand Down

0 comments on commit 815d44a

Please sign in to comment.