Skip to content

Commit 67aa719

Browse files
author
David O'Sullivan
committed
updated multivariate analysis material for 2023
1 parent cd08058 commit 67aa719

13 files changed

+52
-31
lines changed

labs/multivariate-analysis/01-multivariate-analysis-the-problem.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
# Multivariate data
33
In this document, we look at the general problem of dealing with highly multivariate data, which in later documents we will tackle using tools from the [*R* `tidyverse`](02-the-r-tidyverse.md), and techniques broadly categorised as [dimensional reduction](03-dimensional-reduction.md), [classification](04-classification-and-clustering.md), and (next week) statistical modelling.
44

labs/multivariate-analysis/01-multivariate-analysis-the-problem.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
# Multivariate data
33
In this document, we look at the general problem of dealing with highly multivariate data, which in later documents we will tackle using tools from the [*R* `tidyverse`](02-the-r-tidyverse.md), and techniques broadly categorised as [dimensional reduction](03-dimensional-reduction.md), [classification](04-classification-and-clustering.md), and (next week) statistical modelling.
44

labs/multivariate-analysis/02-the-tidyverse.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
First just make sure we have all the data and libraries we need set up.
33
```{r message = FALSE}
44
library(sf)

labs/multivariate-analysis/02-the-tidyverse.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
First just make sure we have all the data and libraries we need set up.
33
```{r message = FALSE}
44
library(sf)

labs/multivariate-analysis/03-dimensional-reduction.Rmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
First just make sure we have all the data and libraries we need set up.
33
```{r message = FALSE}
44
library(sf)

labs/multivariate-analysis/03-dimensional-reduction.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
First just make sure we have all the data and libraries we need set up.
33
```{r message = FALSE}
44
library(sf)

labs/multivariate-analysis/04-classification-and-clustering.Rmd

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
First just make sure we have all the data and libraries we need set up.
3+
34
```{r message = FALSE}
45
library(sf)
56
library(tmap)
@@ -10,6 +11,7 @@ sfd <- st_read('sf_demo.geojson')
1011
sfd <- drop_na(sfd)
1112
sfd.d <- st_drop_geometry(sfd)
1213
```
14+
1315
## Clustering
1416
Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
1517

@@ -31,6 +33,7 @@ Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/)
3133
It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
3234

3335
So here is how we accomplish this in R.
36+
3437
```{r}
3538
km <- kmeans(sfd.d, 7)
3639
sfd$km7 <- as.factor(km$cluster)
@@ -60,6 +63,7 @@ The algorithm in this case looks something like
6063
This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
6164

6265
In *R*, the necessary functions are provided by the `hclust` function
66+
6367
```{R}
6468
hc <- hclust(dist(sfd.d))
6569
plot(hc)
@@ -104,4 +108,4 @@ Although geodemographics ia a very visible example of cluster-based classificati
104108

105109
Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
106110

107-
OK... on to [the assignment](05-assignment-multivariate-analysis.md).
111+
OK... on to [statistical modelling](05-statistical-models.md).

labs/multivariate-analysis/04-classification-and-clustering.md

Lines changed: 12 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,6 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
First just make sure we have all the data and libraries we need set up.
3+
34
```{r message = FALSE}
45
library(sf)
56
library(tmap)
@@ -10,6 +11,7 @@ sfd <- st_read('sf_demo.geojson')
1011
sfd <- drop_na(sfd)
1112
sfd.d <- st_drop_geometry(sfd)
1213
```
14+
1315
## Clustering
1416
Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
1517

@@ -31,14 +33,15 @@ Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/)
3133
It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
3234

3335
So here is how we accomplish this in R.
36+
3437
```{r}
3538
km <- kmeans(sfd.d, 7)
3639
sfd$km7 <- as.factor(km$cluster)
3740
3841
tmap_mode('view')
3942
tm_shape(sfd) +
4043
tm_polygons(col = 'km7') +
41-
tm_legend(legend.outside = TRUE)
44+
tm_legend(legend.outside = T)
4245
```
4346

4447
The `kmeans` function does the work, and requires that we decide in advance how many clusters we want (I picked 7 just because... well... SEVEN). We can retrieve the resulting cluster assignments from the output `km` as `km$cluster` which we convert to a `factor`. The numerical cluster number is meaningless, so the cluster number is properly speaking a factor, and designating as such will allow `tmap` and other packages to handle it intelligently. We can then add it to the spatial data and map it like any other variable.
@@ -60,6 +63,7 @@ The algorithm in this case looks something like
6063
This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
6164

6265
In *R*, the necessary functions are provided by the `hclust` function
66+
6367
```{R}
6468
hc <- hclust(dist(sfd.d))
6569
plot(hc)
@@ -70,17 +74,17 @@ Blimey! What the heck is that thing? As the title says it is a *cluster dendrogr
7074
As you can see, even for this relatively small dataset of only 189 observations, the dendrogram is not easy to read. Again, interactive visualization methods can be used to help with this. However another option is to 'cut the dendrogam' specifying either the height value to do it at, or the number of clusters desired. In this case, it looks like 6 is not a bad option, so...
7175

7276
```{r}
73-
sfd$hc5 <- as.factor(cutree(hc, k = 5))
77+
sfd$hc5 <- cutree(hc, k = 5)
7478
tm_shape(sfd) +
75-
tm_polygons(col = 'hc5') +
79+
tm_polygons(col = 'hc5', palette = 'Set2', style = "cat") +
7680
tm_legend(legend.outside = TRUE)
7781
```
7882

7983
It's good to see that there are clear similarities between this output and the k-means one (at least there were first time I ran the analysis!)
8084

81-
As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information. Other clustering methods are also available. A recently popular one has been the DBSCAN family of methods([here is an R package](https://github.com/mhahsler/dbscan)).
85+
As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information.
8286

83-
Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. For example
87+
Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. FOr example
8488

8589
```{r}
8690
boxplot(sfd$Punemployed ~ sfd$hc5, xlab = 'Cluster', ylab = 'Unemployment')
@@ -89,9 +93,7 @@ boxplot(sfd$Punemployed ~ sfd$hc5, xlab = 'Cluster', ylab = 'Unemployment')
8993
Or we can aggregate the clusters into single areas and assign them values based on the underlying data of all the member units:
9094

9195
```{r}
92-
sfd.c <- sfd %>%
93-
group_by(hc5) %>% # group_by is how you do a 'dissolve' with sf data
94-
summarise_if(is.numeric, mean) # this is how you apply a function to combine results
96+
sfd.c <- aggregate(sfd, by = list(sfd$hc5), mean)
9597
plot(sfd.c, pal = RColorBrewer::brewer.pal(7, "Reds"))
9698
```
9799

@@ -106,4 +108,4 @@ Although geodemographics ia a very visible example of cluster-based classificati
106108

107109
Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
108110

109-
OK... on to [the assignment](05-assignment-multivariate-analysis.md).
111+
OK... on to [statistical modelling](05-statistical-models.md).
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
#### GISC 422 T2 2023
2+
# Assignment 4 Geodemographics in Wellington
3+
I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv).
4+
5+
Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data).
6+
7+
There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start!
8+
9+
When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
10+
11+
Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census.
12+
13+
Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
14+
15+
Prepare your report in R Markdown and run it to produce a final output PDF or Word document (I prefer it if you can convert Word to PDF format) for submission (this means I will see your code as well as the outputs!) Avoid any outputs that are just long lists of data, as these are not very informative (you can do this by prefixing the code chunk that produces the output with ````{r, results = FALSE}`.
16+
17+
Submit your report to the dropbox provided on Blackboard by **24 May**.

labs/multivariate-analysis/05-assignment-multivariate-analysis.md

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
#### GISC 422 T1 2021
1+
#### GISC 422 T2 2023
22
# Assignment 4 Geodemographics in Wellington
3-
I have assembled some demographic data for Wellington from the 2018 census at the Statistical Area 1 level in a file called `welly.gpkg` which you should find in the folder with this week's materials. The data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv).
3+
I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv).
44

5-
Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data).
5+
Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data).
66

7-
There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start!
7+
There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start!
88

99
When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
1010

11-
Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census.
11+
Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census.
1212

1313
Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
1414

0 commit comments

Comments
 (0)