-
Notifications
You must be signed in to change notification settings - Fork 7
/
04-pkgs.Rmd
220 lines (169 loc) · 14 KB
/
04-pkgs.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
# R packages {#pkgs}
## What are packages?
R has over 15,000 packages published on the official 'CRAN' site and many more published on code sharing sites such as GitHub.
Packages are effectively plugins for R that extend it in many ways.
Packages are useful because they enhance the range of things you can do with R, providing additional functions, data and documentation that build on the core (known as 'base') R packages.
They range from general-purpose packages, such as `tidyverse` and `sf`, to domain-specific packages, such as `stats19`.
This chapter demonstrates the package lifecycle with reference `stats19` and provides a taster of R's visualisation capabilities for general purpose packages `ggplot2` and `dplyr`.
The `stats19` package is particularly relevant for reproducible road safety research: its purpose is to download and clean road traffic collision data from the UK's Department for Transport.
Domain-specific packages, such as `stats19`, are often written by subject-matter experts, providing tried and tested solutions within a particular specialism. Packages are reviewed by code experts prior to being made available via CRAN.
Regardless of whichever packages you install and use, you will take the following steps:^[
If for whatever reason you want to uninstall a package you can uninstall it in a fifth stage with commands such as `remove.packages("stats19")`.
]
1. installing the package;
2. loading the package;
3. using the package; and
4. updating the package.
Of these, the third stage takes by far the most amount of time.
Stages 1, 2 and 4 are equally important, however; you cannot use a package unless it has been properly installed, loaded and, to get the best performance out of the latest version, updated when new versions are released.
We will learn each of these stages of the package lifecycle with the `stats19` package.
## The stats19 R package
Like many packages, `stats19` was developed to meet a real world need.
STATS19 data is provided as a free and open resource by the Department for Transport, encouraging evidence-based and accountable road safety research and policy interventions.
However, researchers at the University of Leeds found that repeatedly downloading and formatting open STATS19 data was time-consuming, taking valuable resources away from more valuable (and fun) aspects of the research process.
Significantly, manually recoding the data was error prone.
By packaging code, we found that we could solve the problem in a free, open and reproducible way for everyone [@lovelace_stats19_2019].
By abstracting the process to its fundamental steps (download, read, format), the `stats19` package makes it easy to get the data into appropriate formats (of classes `tbl`, `data.frame` and `sf`), ready for further processing and analysis.
The package built upon previous work [@lovelace_who_2016], with several important improvements, including the conversion of crash data into geographic data in a `sf` data frame for geographic research [e.g. @austin_use_1997].
It enables creation of geographic representations of crash data, geo-referenced to the correct coordinate reference system, in a single function called `format_sf()`.
Part-funded by the RAC Foundation, the package should be of use to academic researchers and professional road safety data analysts working at local authority and national levels in the UK.
The following sections demonstrate how to install, load and use packages with reference to `stats19`. This information can be applied in relation to any package.
## Installing packages
The `stats19` package is available on CRAN.
This means that it has a web page on the CRAN website at [cran.r-project.org](https://cran.r-project.org) with useful information, including who developed the package, what the latest version is, and when it was last updated (see [cran.r-project.org/package=stats19](https://cran.r-project.org/package=stats19)).
More importantly, being 'on CRAN' (which technically means 'available on the [Comprehensive R Archive Network](https://cran.r-project.org/)') means that it can be installed with the command `install.packages()` as follows:^[
To install the development version, which may have new features or bug fixes that are not yet on CRAN, you can use the function`remotes::install_github("org/pkg")`.
The `stats19` package is hosted on the rOpenSci organisation at [github.com/ropensci/stats19](https://github.com/ropensci/stats19), so you can install the development version with `remotes::install_github("ropensci/stats19")` (you must have the `remotes` package installed before that will work).
Note: it's usually safest to stick with the latest version on CRAN unless you know what you're doing.
]
```{r, eval=FALSE}
install.packages("stats19")
```
You might think that now that the package has been installed we can start using it, but that is not true. This is illustrated in the code below, which tries and fails to run the `find_file_name()` function from the `stats19` package to find the file containing STATS19 casualties data for the year 2019. Check that this function exists by running the following command `?find_file_name`:
```{r, error=TRUE}
find_file_name(years = 2019, type = "casualties")
```
## Loading packages
After you have installed a package the next step is to 'load' it.^[
Technically, this means that the *package namespace has been attached to the search path*, making their functions available from the global environment [@wickham_advanced_2014].
]
Load the `stats19` package, that was installed in the previous section, using the following code:
```{r}
library(stats19)
```
What happened?
Other than the message telling us about the package's datasets (most packages load silently, so do not worry if nothing happens when you load a package), the command above made the functions and datasets in the package available to us.
Now we can use functions from the package without an error message, as follows:
```{r}
find_file_name(years = 2019, type = "casualties")
```
This raises the question: how do you know which functions are available in a particular package?
You can find out using the autocompletion, i.e. by pressing `Tab` after typing the package's name, followed by two colons.
Try typing `stats19::` and then hitting `Tab`, for example.
You should see a load of function names appear, which you view by pressing `Up` and `Down` on your keyboard.
The final thing to say about packages is that they can be used without being loaded by typing `package::function()`. We used this before in Section 2.8, where we imported csv data using the `readr` package via `readr::read_csv()`.
So `stats19::find_file_name(years = 2019, type = "casualties")` works even if the package isn't loaded.
You can test this by running the `sf_extSoftVersion()` command from the `sf` package. This command reports the versions of key geographic libraries installed on your system. In the first attempt below, the command fails and reports an error. In the second and third attempts, utilising `::` and `library`, you can see that the command succeeds:
```{r, error=TRUE}
# try running a function without loading the sf package first
sf_extSoftVersion()
```
```{r}
# run a function from a package's namespace without loading it but using ::
sf::sf_extSoftVersion()
```
```{r}
# fun a function call after loading the package (the most common way)
library(sf)
sf_extSoftVersion()
```
As a bonus, try running the command `sf::sf_extSoftVersion` without the brackets `()`.
What does that tell you about the package?
## Using packages
After loading a package, as described in the previous section, you can start using its functions.
In the `stats19` package that means the following command `get_stats19()` will now work:
```{r, message=FALSE}
crashes_2019 = get_stats19(year = 2019, type = "accidents")
nrow(crashes_2019)
```
This command demonstrates the value of packages.
It would have been possible to get the same dataset by manually downloading and cleaning the file from the [STATS19 website on data.gov.uk](https://data.gov.uk/dataset/cb7ae6f0-4be6-4935-9277-47e5ce24a11f/road-safety-data).
However, by using the package, the process has been achieved much faster and with fewer lines of code than would have been possible using general-purpose base R functions.
The result of the `nrow()` function call shows that we have downloaded a decent amount of data representing over 100k road traffic casualty incidents across Great Britain in 2019.
We will use other functions from the package in subsequent sections of this guide.
If you would like to learn more about `stats19` and how it can be used for road safety research, check out its vignettes.
The `stats19` vignette, for example, should appear in the **Help** panel in the bottom right panel in RStudio after running the following command:
```{r, eval=FALSE}
vignette("stats19")
```
## Updating packages
Packages can be updated with the command `update.package()` or in 'Tools > Check for Package Updates' in RStudio.
You only need to install a package once but packages can be updated many times. It is important to update packages regularly because updates will offer bug-fixes and other improvements.
To update just one package, you can give the function a package name, e.g.:
```{r, eval=FALSE}
update.packages(oldPkgs = "stats19")
```
Completing the following short exercises will ensure you've got a good understanding of packages and package versions.
1. Take a look in the 'Packages' tab in the 'Files' pane in RStudio (bottom right by default).
2. What version of the `stats19` package is installed on your computer?
3. What happens the second time you run `update.packages()`. Why?
## ggplot2
`ggplot2` is a generic plotting package that is part of the ['tidyverse'](https://www.tidyverse.org/) meta-package. The `tidyverse` is an 'Opinionated collection of R packages designed for data science'.
<!-- All packages in the tidyverse "share an underlying design philosophy, grammar, and data structures". -->
`ggplot2` is flexible, popular and has dozens of add-on packages which build on it, such as `gganimate`.
To plot non-spatial data, it works as follows (the command should generate the image shown in Figure \@ref(fig:crashes2019time), showing a bar chart of the number of crashes over time):
```{r crashes2019time, message=FALSE, fig.cap="A simple ggplot2 graph."}
library(ggplot2)
ggplot(crashes_2019) + geom_bar(aes(date), width = 1)
```
A key feature of the `ggplot2` package is the function `ggplot2()`. This function initiates the creation of a plot by taking a data object as its main argument followed by one or more ‘geoms’ that represent layers (in this case a bar chart represented by the function `geom_bar()`).
Another distinctive feature of `ggplot2()` is the use of `+` operator to add layers.
The package is excellent for generating publication quality figures.
Starting from a basic idea, you can make incremental tweaks to a plot to get the output you want.
Building on the figure above, we could make the bin width (width of the bars) wider, add colour depending on the crash severity and use count (Figure 4.2) or proportion (Figure 4.3) as our y axis, for example, as follows:
```{r ggpropcrashes, out.width="49%", fig.show='hold', fig.cap="Demonstration of fill and position arguments in ggplot2."}
ggplot(crashes_2019) + geom_bar(aes(date, fill = accident_severity), width = 1)
ggplot(crashes_2019) +
geom_bar(aes(date, fill = accident_severity), width = 1, position = "fill") +
ylab("Proportion of crashes")
```
The package is huge and powerful, with support for a very wide range of plot types and themes, so it is worth taking time to read the documentation associated with the package, starting with the online [reference manual](https://ggplot2.tidyverse.org/reference/index.html) and heading towards the online version of the package's official [book](https://ggplot2-book.org/) [@wickham_ggplot2_2016].
As a final taught bit of `ggplot2` code in this section, create a facetted plot showing how the number of crashes per hour varies across the days of the week by typing the following into the Source Editor and running the chunk line-by-line (the meaning of the commands should become clear by the end of the next section):
```{r ggfacetr, fig.cap="A plot showing a facetted time series plot made with ggplot2.", warning=FALSE, message=FALSE}
library(tidyverse)
crashes_2019 %>%
mutate(hour = lubridate::hour(datetime)) %>%
mutate(day = lubridate::wday(date)) %>%
filter(!is.na(hour)) %>%
ggplot(aes(hour, fill = accident_severity)) +
geom_bar(width = 1.01) +
facet_wrap(~day)
```
**Exercises:**
1. Install a package that build on `ggplot2` that begins with with `gg`. Hint: enter `install.packages(gg)` and hit `Tab` when your cursor is between the `g` and the `)`.
2. Open a help page in the newly installed package with the `?package_name::function()` syntax.
3. Load the package.
4. **Bonus:** try using functionality from the new 'gg' package building on the example above to create plots like those shown below (**Hint:** the right plot below uses the economist theme from the `ggthemes` package; try other themes).
```{r gg-extend, echo=FALSE, message=FALSE, eval=FALSE}
library(ggplot2)
# install.packages("ggthemes")
g1 = ggplot(crashes_2019) + geom_point(aes(x = casualty_type, y = casualty_age))
g2 = ggplot(crashes) + geom_point(aes(x = casualty_type, y = casualty_age)) +
ggthemes::theme_economist()
g3 = cowplot::plot_grid(g1, g2)
ggsave(filename = "figures/ggtheme-plot.png", width = 8, height = 2, dpi = 80)
```
```{r gg2, echo=FALSE, out.width="80%", fig.align="center"}
library(ggplot2)
knitr::include_graphics("figures/ggtheme-plot.png")
```
## dplyr
Another useful package in the tidyverse is `dplyr`, which stands for 'data pliers', which provides a handy syntax for data manipulation.
`dplyr` has many functions for manipulating data frames and using the pipe operator ` %>% `.
The pipe operator puts the output of one command into the first argument of the next, as shown below (**Note:** the results are the same):
```{r}
library(dplyr)
class(crashes)
crashes %>% class()
```
We will learn more about this package and its other functions in Section \@ref(data).