updated multivariate analysis material for 2023

David O'Sullivan · David O'Sullivan · commit 67aa71998fc5 · 2023-07-07T14:11:16.000+12:00
diff --git a/labs/multivariate-analysis/01-multivariate-analysis-the-problem.Rmd b/labs/multivariate-analysis/01-multivariate-analysis-the-problem.Rmd
@@ -1,4 +1,4 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 # Multivariate data
 In this document, we look at the general problem of dealing with highly multivariate data, which in later documents we will tackle using tools from the [*R* `tidyverse`](02-the-r-tidyverse.md), and techniques broadly categorised as [dimensional reduction](03-dimensional-reduction.md), [classification](04-classification-and-clustering.md), and (next week) statistical modelling.
 
diff --git a/labs/multivariate-analysis/01-multivariate-analysis-the-problem.md b/labs/multivariate-analysis/01-multivariate-analysis-the-problem.md
@@ -1,4 +1,4 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 # Multivariate data
 In this document, we look at the general problem of dealing with highly multivariate data, which in later documents we will tackle using tools from the [*R* `tidyverse`](02-the-r-tidyverse.md), and techniques broadly categorised as [dimensional reduction](03-dimensional-reduction.md), [classification](04-classification-and-clustering.md), and (next week) statistical modelling.
 
diff --git a/labs/multivariate-analysis/02-the-tidyverse.Rmd b/labs/multivariate-analysis/02-the-tidyverse.Rmd
@@ -1,4 +1,4 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 First just make sure we have all the data and libraries we need set up.
 ```{r message = FALSE}
 library(sf)
diff --git a/labs/multivariate-analysis/02-the-tidyverse.md b/labs/multivariate-analysis/02-the-tidyverse.md
@@ -1,4 +1,4 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 First just make sure we have all the data and libraries we need set up.
 ```{r message = FALSE}
 library(sf)
diff --git a/labs/multivariate-analysis/03-dimensional-reduction.Rmd b/labs/multivariate-analysis/03-dimensional-reduction.Rmd
@@ -1,4 +1,4 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 First just make sure we have all the data and libraries we need set up.
 ```{r message = FALSE}
 library(sf)
diff --git a/labs/multivariate-analysis/03-dimensional-reduction.md b/labs/multivariate-analysis/03-dimensional-reduction.md
@@ -1,4 +1,4 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 First just make sure we have all the data and libraries we need set up.
 ```{r message = FALSE}
 library(sf)
diff --git a/labs/multivariate-analysis/04-classification-and-clustering.Rmd b/labs/multivariate-analysis/04-classification-and-clustering.Rmd
@@ -1,5 +1,6 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 First just make sure we have all the data and libraries we need set up.
+
 ```{r message = FALSE}
 library(sf)
 library(tmap)
@@ -10,6 +11,7 @@ sfd <- st_read('sf_demo.geojson')
 sfd <- drop_na(sfd)
 sfd.d <- st_drop_geometry(sfd)
 ```
+
 ## Clustering
 Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
 
@@ -31,6 +33,7 @@ Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/)
 It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
 
 So here is how we accomplish this in R.
+
 ```{r}
 km <- kmeans(sfd.d, 7)
 sfd$km7 <- as.factor(km$cluster)
@@ -60,6 +63,7 @@ The algorithm in this case looks something like
 This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
 
 In *R*, the necessary functions are provided by the `hclust` function
+
 ```{R}
 hc <- hclust(dist(sfd.d))
 plot(hc)
@@ -104,4 +108,4 @@ Although geodemographics ia a very visible example of cluster-based classificati
 
 Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
 
-OK... on to [the assignment](05-assignment-multivariate-analysis.md).
+OK... on to [statistical modelling](05-statistical-models.md).
diff --git a/labs/multivariate-analysis/04-classification-and-clustering.md b/labs/multivariate-analysis/04-classification-and-clustering.md
@@ -1,5 +1,6 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 First just make sure we have all the data and libraries we need set up.
+
 ```{r message = FALSE}
 library(sf)
 library(tmap)
@@ -10,6 +11,7 @@ sfd <- st_read('sf_demo.geojson')
 sfd <- drop_na(sfd)
 sfd.d <- st_drop_geometry(sfd)
 ```
+
 ## Clustering
 Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
 
@@ -31,14 +33,15 @@ Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/)
 It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
 
 So here is how we accomplish this in R.
+
 ```{r}
 km <- kmeans(sfd.d, 7)
 sfd$km7 <- as.factor(km$cluster)
 
 tmap_mode('view')
 tm_shape(sfd) +
   tm_polygons(col = 'km7') +
-  tm_legend(legend.outside = TRUE)
+  tm_legend(legend.outside = T)
 ```
 
 The `kmeans` function does the work, and requires that we decide in advance how many clusters we want (I picked 7 just because... well... SEVEN). We can retrieve the resulting cluster assignments from the output `km` as `km$cluster` which we convert to a `factor`. The numerical cluster number is meaningless, so the cluster number is properly speaking a factor, and designating as such will allow `tmap` and other packages to handle it intelligently. We can then add it to the spatial data and  map it like any other variable.
@@ -60,6 +63,7 @@ The algorithm in this case looks something like
 This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
 
 In *R*, the necessary functions are provided by the `hclust` function
+
 ```{R}
 hc <- hclust(dist(sfd.d))
 plot(hc)
@@ -70,17 +74,17 @@ Blimey! What the heck is that thing? As the title says it is a *cluster dendrogr
 As you can see, even for this relatively small dataset of only 189 observations, the dendrogram is not easy to read. Again, interactive visualization methods can be used to help with this. However another option is to 'cut the dendrogam' specifying either the height value to do it at, or the number of clusters desired. In this case, it looks like 6 is not a bad option, so...
 
 ```{r}
-sfd$hc5 <- as.factor(cutree(hc, k = 5))
+sfd$hc5 <- cutree(hc, k = 5)
 tm_shape(sfd) +
-  tm_polygons(col = 'hc5') +
+  tm_polygons(col = 'hc5', palette = 'Set2', style = "cat") +
   tm_legend(legend.outside = TRUE)
 ```
 
 It's good to see that there are clear similarities between this output and the k-means one (at least there were first time I ran the analysis!)
 
-As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information. Other clustering methods are also available. A recently popular one has been the DBSCAN family of methods([here is an R package](https://github.com/mhahsler/dbscan)).
+As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information.
 
-Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. For example
+Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. FOr example
 
 ```{r}
 boxplot(sfd$Punemployed ~ sfd$hc5, xlab = 'Cluster', ylab = 'Unemployment')
@@ -89,9 +93,7 @@ boxplot(sfd$Punemployed ~ sfd$hc5, xlab = 'Cluster', ylab = 'Unemployment')
 Or we can aggregate the clusters into single areas and assign them values based on the underlying data of all the member units:
 
 ```{r}
-sfd.c <- sfd %>%
-  group_by(hc5) %>%              # group_by is how you do a 'dissolve' with sf data
-  summarise_if(is.numeric, mean) # this is how you apply a function to combine results
+sfd.c <- aggregate(sfd, by = list(sfd$hc5), mean)
 plot(sfd.c, pal = RColorBrewer::brewer.pal(7, "Reds"))
 ```
 
@@ -106,4 +108,4 @@ Although geodemographics ia a very visible example of cluster-based classificati
 
 Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
 
-OK... on to [the assignment](05-assignment-multivariate-analysis.md).
+OK... on to [statistical modelling](05-statistical-models.md).
diff --git a/labs/multivariate-analysis/05-assignment-multivariate-analysis.Rmd b/labs/multivariate-analysis/05-assignment-multivariate-analysis.Rmd
@@ -0,0 +1,17 @@
+#### GISC 422 T2 2023
+# Assignment 4 Geodemographics in Wellington
+I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv). 
+
+Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data). 
+
+There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start! 
+
+When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
+
+Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census. 
+
+Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
+
+Prepare your report in R Markdown and run it to produce a final output PDF or Word document (I prefer it if you can convert Word to PDF format) for submission (this means I will see your code as well as the outputs!) Avoid any outputs that are just long lists of data, as these are not very informative (you can do this by prefixing the code chunk that produces the output with ````{r, results = FALSE}`.
+
+Submit your report to the dropbox provided on Blackboard by **24 May**.
diff --git a/labs/multivariate-analysis/05-assignment-multivariate-analysis.md b/labs/multivariate-analysis/05-assignment-multivariate-analysis.md
@@ -1,14 +1,14 @@
-#### GISC 422 T1 2021
+#### GISC 422 T2 2023
 # Assignment 4 Geodemographics in Wellington
-I have assembled some demographic data for Wellington from the 2018 census at the Statistical Area 1 level in a file called `welly.gpkg` which you should find in the folder with this week's materials. The data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv).
+I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv). 
 
-Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data).
+Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data). 
 
-There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start!
+There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start! 
 
 When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
 
-Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census.
+Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census. 
 
 Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
 
diff --git a/labs/multivariate-analysis/README.md b/labs/multivariate-analysis/README.md
@@ -1,12 +1,10 @@
-#### GISC 422 T1 2020
+#### GISC 422 T2 2023
 # Multivariate analysis overview
 This week we're looking at several things. You'll find all the necessary materials in this [zip file](multivariate-analysis.zip?raw=true).
 
-In class, I will work through the Rmarkdown files explaining the various methods (the first, third, and fourth below).
-
 The instructions for each stage of the material are linked below:
-+ Introducing [the multivariate data problem](01-multivariate-analysis-the-problem.md)
-+ The [*R* `tidyverse`](02-the-tidyverse.md) and cleaning up messy data
-+ [Dimensional reduction](03-dimensional-reduction.md)
-+ [Classification and clustering](04-classification-and-clustering.md)
-+ [Assignment on multivariate analysis](05-assignment-multivariate-analysis.md)
+* Introducing [the multivariate data problem](01-multivariate-analysis-the-problem.md)
+* The [*R* `tidyverse`](02-the-tidyverse.md) and cleaning up messy data
+* [Dimensional reduction](03-dimensional-reduction.md)
+* [Classification and clustering](04-classification-and-clustering.md)
+* [Assignment on multivariate analysis](05-assignment-multivariate-analysis.md)
diff --git a/labs/multivariate-analysis/multivariate-analysis.zip b/labs/multivariate-analysis/multivariate-analysis.zip
diff --git a/labs/multivariate-analysis/welly.gpkg b/labs/multivariate-analysis/welly.gpkg

Original file line number	Diff line number	Diff line change
`@@ -1,4 +1,4 @@`
`1`		`-#### GISC 422 T1 2021`
	`1`	`+#### GISC 422 T2 2023`
`2`	`2`	`# Multivariate data`
`3`	`3`	In this document, we look at the general problem of dealing with highly multivariate data, which in later documents we will tackle using tools from the [R `tidyverse`](02-the-r-tidyverse.md), and techniques broadly categorised as [dimensional reduction](03-dimensional-reduction.md), [classification](04-classification-and-clustering.md), and (next week) statistical modelling.
`4`	`4`