You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: labs/multivariate-analysis/01-multivariate-analysis-the-problem.Rmd
+1-1Lines changed: 1 addition & 1 deletion
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
#### GISC 422 T1 2021
1
+
#### GISC 422 T2 2023
2
2
# Multivariate data
3
3
In this document, we look at the general problem of dealing with highly multivariate data, which in later documents we will tackle using tools from the [*R*`tidyverse`](02-the-r-tidyverse.md), and techniques broadly categorised as [dimensional reduction](03-dimensional-reduction.md), [classification](04-classification-and-clustering.md), and (next week) statistical modelling.
Copy file name to clipboardExpand all lines: labs/multivariate-analysis/01-multivariate-analysis-the-problem.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
#### GISC 422 T1 2021
1
+
#### GISC 422 T2 2023
2
2
# Multivariate data
3
3
In this document, we look at the general problem of dealing with highly multivariate data, which in later documents we will tackle using tools from the [*R*`tidyverse`](02-the-r-tidyverse.md), and techniques broadly categorised as [dimensional reduction](03-dimensional-reduction.md), [classification](04-classification-and-clustering.md), and (next week) statistical modelling.
Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
15
17
@@ -31,6 +33,7 @@ Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/)
31
33
It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
32
34
33
35
So here is how we accomplish this in R.
36
+
34
37
```{r}
35
38
km <- kmeans(sfd.d, 7)
36
39
sfd$km7 <- as.factor(km$cluster)
@@ -60,6 +63,7 @@ The algorithm in this case looks something like
60
63
This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
61
64
62
65
In *R*, the necessary functions are provided by the `hclust` function
66
+
63
67
```{R}
64
68
hc <- hclust(dist(sfd.d))
65
69
plot(hc)
@@ -104,4 +108,4 @@ Although geodemographics ia a very visible example of cluster-based classificati
104
108
105
109
Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
106
110
107
-
OK... on to [the assignment](05-assignment-multivariate-analysis.md).
111
+
OK... on to [statistical modelling](05-statistical-models.md).
Whereas dimensional reduction methods focus on the variables in a dataset, clustering methods focus on the observations and the differences and similarities between them. The idea of clustering analysis is to break the dataset into clusters or groups of observations that are similar to one another and different from others in the data.
15
17
@@ -31,14 +33,15 @@ Here's an [illustration of this working](https://kkevsterrr.github.io/K-Means/)
31
33
It's important to realise that k-means clustering is non-deterministic, as the choice of intial cluster centres is often random, and can affect the final assignment arrived at.
32
34
33
35
So here is how we accomplish this in R.
36
+
34
37
```{r}
35
38
km <- kmeans(sfd.d, 7)
36
39
sfd$km7 <- as.factor(km$cluster)
37
40
38
41
tmap_mode('view')
39
42
tm_shape(sfd) +
40
43
tm_polygons(col = 'km7') +
41
-
tm_legend(legend.outside = TRUE)
44
+
tm_legend(legend.outside = T)
42
45
```
43
46
44
47
The `kmeans` function does the work, and requires that we decide in advance how many clusters we want (I picked 7 just because... well... SEVEN). We can retrieve the resulting cluster assignments from the output `km` as `km$cluster` which we convert to a `factor`. The numerical cluster number is meaningless, so the cluster number is properly speaking a factor, and designating as such will allow `tmap` and other packages to handle it intelligently. We can then add it to the spatial data and map it like any other variable.
@@ -60,6 +63,7 @@ The algorithm in this case looks something like
60
63
This approach is 'agglomerative' because we start with individual observations. It is possible to proceed in the other direction repeatedly subdividing the dataset into subsets until we get to individual cases, or perhaps until some measure of the cluster quality tells us we can't improve the solution any further. This method is very often used with network data when cluster detection is known as *community detection* (more on that next week).
61
64
62
65
In *R*, the necessary functions are provided by the `hclust` function
66
+
63
67
```{R}
64
68
hc <- hclust(dist(sfd.d))
65
69
plot(hc)
@@ -70,17 +74,17 @@ Blimey! What the heck is that thing? As the title says it is a *cluster dendrogr
70
74
As you can see, even for this relatively small dataset of only 189 observations, the dendrogram is not easy to read. Again, interactive visualization methods can be used to help with this. However another option is to 'cut the dendrogam' specifying either the height value to do it at, or the number of clusters desired. In this case, it looks like 6 is not a bad option, so...
It's good to see that there are clear similarities between this output and the k-means one (at least there were first time I ran the analysis!)
80
84
81
-
As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information. Other clustering methods are also available. A recently popular one has been the DBSCAN family of methods([here is an R package](https://github.com/mhahsler/dbscan)).
85
+
As with k-means, there are more details around all of this. Different approaches to calculating distances can be chosen (see `?dist`) and various options for the exact algorith for merging clusters are available by setting the `method` option in the `hclust` function. The function help is the place to look for more information.
82
86
83
-
Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. For example
87
+
Once clusters have been assigned, we can do further analysis comparing characteristics of different clusters. FOr example
Or we can aggregate the clusters into single areas and assign them values based on the underlying data of all the member units:
90
94
91
95
```{r}
92
-
sfd.c <- sfd %>%
93
-
group_by(hc5) %>% # group_by is how you do a 'dissolve' with sf data
94
-
summarise_if(is.numeric, mean) # this is how you apply a function to combine results
96
+
sfd.c <- aggregate(sfd, by = list(sfd$hc5), mean)
95
97
plot(sfd.c, pal = RColorBrewer::brewer.pal(7, "Reds"))
96
98
```
97
99
@@ -106,4 +108,4 @@ Although geodemographics ia a very visible example of cluster-based classificati
106
108
107
109
Classification and clustering is an enormous topic area with numerous different methods available, many of them now falling under the rubric of machine-learning.
108
110
109
-
OK... on to [the assignment](05-assignment-multivariate-analysis.md).
111
+
OK... on to [statistical modelling](05-statistical-models.md).
I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv).
4
+
5
+
Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data).
6
+
7
+
There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start!
8
+
9
+
When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
10
+
11
+
Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census.
12
+
13
+
Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
14
+
15
+
Prepare your report in R Markdown and run it to produce a final output PDF or Word document (I prefer it if you can convert Word to PDF format) for submission (this means I will see your code as well as the outputs!) Avoid any outputs that are just long lists of data, as these are not very informative (you can do this by prefixing the code chunk that produces the output with ````{r, results = FALSE}`.
16
+
17
+
Submit your report to the dropbox provided on Blackboard by **24 May**.
Copy file name to clipboardExpand all lines: labs/multivariate-analysis/05-assignment-multivariate-analysis.md
+5-5Lines changed: 5 additions & 5 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,14 +1,14 @@
1
-
#### GISC 422 T1 2021
1
+
#### GISC 422 T2 2023
2
2
# Assignment 4 Geodemographics in Wellington
3
-
I have assembled some demographic data for Wellington from the 2018 census at the Statistical Area 1 level in a file called `welly.gpkg` which you should find in the folder with this week's materials. The data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv).
3
+
I have assembled some [demographic data for Wellington](welly.gpkg) from the 2018 census at the Statistical Area 1 level (the data were obtained [here](https://datafinder.stats.govt.nz/layer/104612-2018-census-individual-part-1-total-new-zealand-by-statistical-area-1/)). Descriptions of the variables are in [this table](sa1-2018-census-individual-part-1-total-nz-lookup-table.csv).
4
4
5
-
Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data).
5
+
Use these data to conduct multivariate data exploration using any of the approaches discussed here. Think of it as producing a descriptive report of the social geography of Wellington focusing on some aspect of interest (but **not** a single dimension like ethnicity, race, family structure, or whatever, in other words, work with the multivariate, multidimensional aspect of the data).
6
6
7
-
There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start!
7
+
There are lots more variables than you need, so you should reduce the data down to a selection (use `dplyr::select` for this). Think about which variables you want to retain before you start!
8
8
9
9
When you have a reduced set of variables to work with (but not too reduced... the idea is to demonstrate handling high-dimensional data), then you should also standardise the variables in some way so that they are all scaled to a similar numerical range. You can do this with `mutate` functions. You will probably need to keep the total population column for the standardisation!
10
10
11
-
Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census.
11
+
Once that data preparation is completed, then run principal components analysis and clustering analysis (either method) to produce maps that shed some light on the demographic structure of the city as portrayed in the 2018 Census.
12
12
13
13
Include these maps in a report that shows clear evidence of having explored the data using any tools we have seen this week (feel free to include others from earlier weeks if they help!)
0 commit comments