-
Notifications
You must be signed in to change notification settings - Fork 90
/
19_functions-part2.Rmd
189 lines (126 loc) · 6.95 KB
/
19_functions-part2.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# Write your own R functions, part 2 {#functions-part2}
```{r include = FALSE}
source("common.R")
```
<!--Original content: https://stat545.com/block011_write-your-own-function-02.html-->
## Where were we? Where are we going?
In [part 1](#functions-part1) we wrote our first R function to compute the difference between the max and min of a numeric vector. We checked the validity of the function's only argument and, informally, we verified that it worked pretty well.
In this part, we generalize this function, learn more technical details about R functions, and set default values for some arguments.
## Load the Gapminder data
As usual, load gapminder.
```{r start_func2}
library(gapminder)
```
## Restore our max minus min function
Let's keep our previous function around as a baseline.
```{r}
mmm <- function(x) {
stopifnot(is.numeric(x))
max(x) - min(x)
}
```
## Generalize our function to other quantiles
The max and the min are special cases of a __quantile__. Here are other special cases you may have heard of:
* median = 0.5 quantile
* 1st quartile = 0.25 quantile
* 3rd quartile = 0.75 quantile
If you're familiar with [box plots][wiki-boxplot], the rectangle typically runs from the 1st quartile to the 3rd quartile, with a line at the median.
If $q$ is the $p$-th quantile of a set of $n$ observations, what does that mean? Approximately $pn$ of the observations are less than $q$ and $(1 - p)n$ are greater than $q$. Yeah, you need to worry about rounding to an integer and less/greater than or equal to, but these details aren't critical here.
Let's generalize our function to take the difference between any two quantiles. We can still consider the max and min, if we like, but we're not limited to that.
## Get something that works, again
The eventual inputs to our new function will be the data `x` and two probabilities.
First, play around with the `quantile()` function. Convince yourself you know how to use it, for example, by cross-checking your results with other built-in functions.
```{r}
quantile(gapminder$lifeExp)
quantile(gapminder$lifeExp, probs = 0.5)
median(gapminder$lifeExp)
quantile(gapminder$lifeExp, probs = c(0.25, 0.75))
boxplot(gapminder$lifeExp, plot = FALSE)$stats
```
Now write a code snippet that takes the difference between two quantiles.
```{r}
the_probs <- c(0.25, 0.75)
the_quantiles <- quantile(gapminder$lifeExp, probs = the_probs)
max(the_quantiles) - min(the_quantiles)
```
## Turn the working interactive code into a function, again
I'll use `qdiff` as the base of our function's name. I copy the overall structure from our previous "max minus min" work but replace the guts of the function with the more general code we just developed.
```{r}
qdiff1 <- function(x, probs) {
stopifnot(is.numeric(x))
the_quantiles <- quantile(x = x, probs = probs)
max(the_quantiles) - min(the_quantiles)
}
qdiff1(gapminder$lifeExp, probs = c(0.25, 0.75))
IQR(gapminder$lifeExp) # hey, we've reinvented IQR
qdiff1(gapminder$lifeExp, probs = c(0, 1))
mmm(gapminder$lifeExp)
```
Again we do some informal tests against familiar results and external implementations.
## Argument names: freedom and conventions
I want you to understand the importance of argument names.
I can name my arguments almost anything I like. Proof:
```{r}
qdiff2 <- function(zeus, hera) {
stopifnot(is.numeric(zeus))
the_quantiles <- quantile(x = zeus, probs = hera)
max(the_quantiles) - min(the_quantiles)
}
qdiff2(zeus = gapminder$lifeExp, hera = 0:1)
```
While I can name my arguments after Greek gods, it's usually a bad idea. Take all opportunities to make things more self-explanatory via meaningful names.
If you are going to pass the arguments of your function as arguments of a built-in function, consider copying the argument names. Unless you have a good reason to do your own thing (some argument names are bad!), be consistent with the existing function. Again, the reason is to reduce your cognitive load. This is what I've been doing all along and now you know why:
```{r}
qdiff1
```
We took this detour so you could see there is no *structural* relationship between my arguments (`x` and `probs`) and those of `quantile()` (also `x` and `probs`). The similarity or equivalence of the names __accomplishes nothing__ as far as R is concerned; it is solely for the benefit of humans reading, writing, and using the code. Which is very important!
## What a function returns
By this point, I expect someone will have asked about the last line in my function's body. Look above for a reminder of the function's definition.
By default, a function returns the result of the last line of the body. I am just letting that happen with the line `max(the_quantiles) - min(the_quantiles)`. However, there is an explicit function for this: `return()`. I could just as easily make this the last line of my function's body:
```{r eval = FALSE}
return(max(the_quantiles) - min(the_quantiles))
```
You absolutely must use `return()` if you want to return early based on some condition, i.e. before execution gets to the last line of the body. Otherwise, you can decide your own conventions about when you use `return()` and when you don't.
## Default values: freedom to NOT specify the arguments
What happens if we call our function but neglect to specify the probabilities?
```{r error = TRUE}
qdiff1(gapminder$lifeExp)
```
Oops! At the moment, this causes a fatal error. It can be nice to provide some reasonable default values for certain arguments. In our case, it would be crazy to specify a default value for the primary input `x`, but very kind to specify a default for `probs`.
We started by focusing on the max and the min, so I think those make reasonable defaults. Here's how to specify that in a function definition.
```{r}
qdiff3 <- function(x, probs = c(0, 1)) {
stopifnot(is.numeric(x))
the_quantiles <- quantile(x, probs)
max(the_quantiles) - min(the_quantiles)
}
```
Again we check how the function works, in old examples and new, specifying the `probs` argument and not.
```{r}
qdiff3(gapminder$lifeExp)
mmm(gapminder$lifeExp)
qdiff3(gapminder$lifeExp, c(0.1, 0.9))
```
## Check the validity of arguments, again
__Exercise:__ upgrade our argument validity checks in light of the new argument `probs`.
```{r}
## problems identified during class
## we're not checking that probs is numeric
## we're not checking that probs is length 2
## we're not checking that probs are in [0,1]
```
## Wrap-up and what's next?
Here's the function we've written so far:
```{r end_func2}
qdiff3
```
What we've accomplished:
* We've generalized our first function to take a difference between arbitrary quantiles.
* We've specified default values for the probabilities that set the quantiles.
Where to next? In [part 3](#functions-part3) we tackle `NA`s, the special `...` argument, and formal unit testing.
## Resources
Hadley Wickham's book [Advanced R][adv-r] [-@wickham2015a]:
* Section on [function arguments][adv-r-fxn-args]
* Section on [return values][adv-r-return-values]
```{r links, child="links.md"}
```