-
Notifications
You must be signed in to change notification settings - Fork 7
/
05-data.Rmd
156 lines (118 loc) · 8.79 KB
/
05-data.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
# Manipulating data {#data}
This section is an introduction to manipulating datasets using the `dplyr` package.
As outlined in the previous section, `dplyr` and `ggplot2` are part of the `tidyverse`, which aims to provide a user-friendly framework for data science [@grolemund_data_2016].
Experience of teaching R over the past few years suggests that many people find it easier to get going with data driven research if they learn the 'tidy' workflow presented in this section.
However, if you do not like this style of R code or you are simply curious, we encourage you to try alternative approaches for achieving the similar results using base R [@rcoreteam_language_2020]^[
Run the command `help.start()` to see a resources introducing base R, and [Chapter 6 on lists and data frames](https://cran.r-project.org/doc/manuals/r-release/R-intro.html#Lists-and-data-frames) in [An Introduction to R](https://cran.r-project.org/doc/manuals/r-release/R-intro.pdf) in particular for an introduction to data manipulation with base R.
]
, the `data.table` R package [@R-data.table] or other languages such as [Python](https://www.python.org/) or [Julia](https://julialang.org/).
If you just want to get going with processing data, the `tidyverse` is a solid and popular starting point.
<!-- Todo: add new part here? -->
Before diving into the `tidyverse`, it is worth re-capping where we have got to so far as we have covered a lot of ground.
Section \@ref(basics) introduced R's basic syntax; Section \@ref(rstudio) showed how to use the Source Editor and other features of RStudio to support data science; and Section \@ref(pkgs) introduced the concept and practicalities of R packages, with reference to `stats19`, `ggplot2` and `dplyr`.
In this section, we will start with a blank slate.
In Section \@ref(basics) we learned that in R having a 'clear desk' means an *empty global environment*.
This can be achieved by running the following command, which removes the `list()` of all objects returned by the function `ls()`:
```{r}
rm(list = ls())
```
## tibbles
Although the data processing techniques in R are capable of handling large datasets, such as the `crashes_2019` object that we created in the previous section, representing 100k+ casualties, it makes sense to start small.
Let's start by re-creating the `crashes` dataset from Section \@ref(basics), but this time using the `tidyverse` `tibble()` function. This is the `tidyverse` equivalent of base R's `data.frame`.
`tibble` objects can be created, after loading the `tidyverse`, as follows:
```{r, message=FALSE}
library(tidyverse)
crashes = tibble(
casualty_type = c("pedestrian", "cyclist", "cat"),
casualty_age = seq(from = 20, to = 60, by = 20),
vehicle_type = c("car", "bus", "tank"),
dark = c(TRUE, FALSE, TRUE)
)
```
In the previous code chunk, we passed four vector objects as *named arguments* to the `tibble` function, resulting in columns such as `casualty_type`.
A `tibble` is just a fancy way of representing `data.frame` objects, preferred by `tidyverse` users and optimised for data science.
It has a few sensible defaults and advantages compared with the `data.frame`, one of which can be seen by printing a `tibble`:
```{r}
class(crashes)
crashes
```
Note the `<chr>`, `<dbl>` or `<lgl>` text below each column, providing a quick indication of the class of each variable - this is not provided when using `data.frame`.
## filter() and select() rows and columns
In the previous section, we briefly introduced the package `dplyr`, which provides an alternative to base R for manipulating objects. `dplyr` provides different, and some would argue simpler, approaches for subsetting rows and columns than base R.
`dplyr` operations for subsetting rows (with the function `filter()`) and columns (with the function `select()`) are demonstrated below. Here we can also see the use of the pipe operator `%>%` to take the dataset and apply the function to that dataset.
```{r}
crashes %>% filter(casualty_age > 50) # filters rows
crashes %>% select(casualty_type) # select just one column
```
It should be clear what happened: `filter()` returns only rows that match the criteria in the function call, only observations with a `casualty_age` greater than 50 in this case.
Likewise, `select()` returns data objects that include only columns named inside the function call, `casualty_type` in this case.
To gain a greater understanding of the functions, type and run the following commands, which also illustrate how the `%>%` can be used more than once to manipulate data (more on this soon):
```{r}
crashes_darkness = crashes %>% filter(dark)
crashes_a = crashes %>% select(contains("a"))
crashes_darkness_a = crashes %>%
filter(dark) %>%
select(contains("a"))
```
Can you guess what the dimensions of the resulting objects will be?
Write down your guesses for the number of rows and number of columns that the new objects, `crashes_darkness` to `crashes_darkness_a`, have before running the following commands to find out. This also demonstrates the handy function `dim()`, short for dimension (results not shown):^[
Note that the number of rows is reported before the number of columns.
This is a feature of R: rows are also specificiefied first when subsetting using the square brackets in commands such as `crashes[1, 2:3]`.
]
```{r, eval=FALSE}
dim(crashes)
dim(crashes_darkness)
?contains # get help on contains() to help guess the output of the next line
dim(crashes_a)
dim(crashes_darkness_a)
```
Look at the help pages associated with `filter()`, `select()` and the related function `slice()` as follows and try running the examples that you will find at the bottom of the help pages for each to gain a greater understanding (note you can use the `package::function` notation to get help on functions also):
```{r, eval=FALSE}
?dplyr::filter
?dplyr::select
?dplyr::slice
```
## Ordering and selecting the 'top n'
Other useful pipe-friendly functions are `arrange()` and `top_n()`. `arrange()` can be used to sort data. Within. the `arrage()` function, optional arguments can be used to define the order in which it is sorted. `top_n()`simply selects the top 'n' number of rows in your data frame.
We can use these functions to arrange datasets and take the top most ‘n’ values, as follows:
```{r}
crashes %>%
arrange(vehicle_type)
crashes %>%
top_n(n = 1, wt = casualty_age)
```
<!-- ## Long and wide data -->
## Summarise
A powerful two-function combination is `group_by()` and `summarise()`.
Used together, they can provide *grouped summaries* of datasets.
In the example below, we find the mean age of casualties in dark and light conditions.
```{r}
crashes %>%
group_by(dark) %>%
summarise(mean_age = mean(casualty_age))
```
The example above shows a powerful feature of these pipelines. Many operations can be 'chained' together, whilst keeping readability with subsequent commands stacked below earlier operations. The combination of `group_by()` and `summarise()` can be very useful in preparing data for visualisation with a `ggplot2` function.
Another useful feature of the `tidyverse` from a user perspective is the autocompletion of column names mid pipe.
If you have not noticed this already, you can test it by typing the following, putting your cursor just before the `)` and pressing `Tab`:
```{r, eval=FALSE}
crashes %>% select(ca) # press Tab when your cursor is just after the a
```
You should see `casualty_age` and `casualty_type` pop up as options that can be selected by pressing `Up` and `Down`.
This may not seem like much, but when analysing large datasets with dozens of variables, it can be a godsend.
Rather than providing a comprehensive introduction to the `tidyverse` suite of packages, this section should have offered enough to get started with using it for road safety data analysis.
For further information, check out up-to-date online courses from respected organisations like [Data Carpentry](https://datacarpentry.org/R-ecology-lesson/index.html) and the free online [books](https://bookdown.org/) such as [R for Data Science]((https://r4ds.had.co.nz/)) [@grolemund_data_2016].
## Tidyverse exercises
1. Use `dplyr` to filter rows in which `casualty_age` is less than 18, and then 28.
2. Use the `arrange` function to sort the `crashes` object in descending order of age (**Hint:** see the `?arrange` help page).
3. Read the help page of `dplyr::mutate()`. What does the function do?
4. Use the mutate function to create a new variable, `birth_year`, in the `crashes` data.frame which is defined as the current year minus their age.
5. **Bonus:** Use the ` %>% ` operator to filter the output from the previous exercise so that only observations with `birth_year` after 1969 are returned.
```{r dplyr, eval=FALSE, echo=FALSE}
# answers
crashes %>%
arrange(desc(casualty_age))
crashes %>% filter(casualty_age > 21)
crashes %>%
mutate(birth_year = 2019 - casualty_age) %>%
filter(birth_year > 1969)
```