forked from tjmahr/tjmisc
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
252 lines (188 loc) · 6.55 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-"
)
```
# tjmisc
[](https://travis-ci.org/tjmahr/tjmisc)
The goal of tjmisc is to gather miscellaneous helper functions, mostly for use
in [my dissertation](https://github.com/tjmahr/dissertation).
Apologies in advance. I think "misc" packages are kind of bad because packages
should be focused on specific problems: for example, my helper packages for
[working on polynomials](https://github.com/tjmahr/polypoly),
[printing numbers](https://github.com/tjmahr/printy) or
[tidying MCMC samples](https://github.com/tjmahr/tristan). Having modular code
snapping together like Lego blocks is better than a grab-bag of functions, it's
true, but using `library(helpers)` is much, much better than using
`source("helpers.R")`. So here we are... in the grab-bag.
## Installation
You can install the tjmisc from github with:
```{r, eval = FALSE}
# install.packages("devtools")
devtools::install_github("tjmahr/tjmisc")
```
## Examples
### Sample groups of data
`sample_n_of()` is like dplyr's `sample_n()` but it samples groups.
```{r example}
library(dplyr, warn.conflicts = FALSE)
library(tjmisc)
set.seed(11022017)
data <- tibble::tibble(
day = 1:10 %>% rep(10) %>% sort(),
id = 1:10 %>% rep(10),
block = letters[1:5] %>% rep(10) %>% sort() %>% rep(2),
value = rnorm(100) %>% round(2))
# data from 3 days
sample_n_of(data, 3, day)
# data from 1 id
sample_n_of(data, 1, id)
# data from 2 block-id pairs
sample_n_of(data, 2, block, id)
```
### Tidy quantiles
`tidy_quantile()` returns a dataframe with quantiles for a given variable. I
like to use it to select values for plotting model predictions.
```{r}
iris %>%
tidy_quantile(Petal.Length)
iris %>%
group_by(Species) %>%
tidy_quantile(Petal.Length)
```
### Tidy correlations
`tidy_correlation()` calculates correlations between pairs of selected dataframe
columns. It accepts `dplyr::select()` selection semantics, and it respects
grouped dataframes.
```{r}
tidy_correlation(iris, -Species)
iris %>%
dplyr::group_by(Species) %>%
tidy_correlation(dplyr::starts_with("Petal"))
```
### Pairwise comparisons
`compare_pairs()` compares all pairs of values among levels of a categorical
variable. Hmmm, that sounds confusing. Here's an example. We compute the
difference in average score between each pair of workers.
```{r}
to_compare <- nlme::Machines %>%
group_by(Worker) %>%
summarise(avg_score = mean(score)) %>%
print()
to_compare %>%
compare_pairs(Worker, avg_score) %>%
rename(difference = value) %>%
mutate_if(is.numeric, round, 1)
```
### Et cetera
`ggpreview()` is like ggplot2's `ggsave()` but it saves an image to a temporary
file and then opens it in the system viewer. If you've ever found yourself in
a loop of saving a plot, leaving RStudio to doubleclick the file, sighing, going
back to RStudio, tweaking the height or width or plot theme, ever so slowly
spiraling in on your desired plot, then `ggpreview()` is for you.
`seq_along_rows()` saves a few keystrokes in for-loops that iterate over
dataframe rows.
```{r}
cars %>% head(5) %>% seq_along_rows()
cars %>% head(0) %>% seq_along_rows()
```
`is_same_as_last` and `replace_if_same_as_last()` are helpers for formatting
tables. I use them to replace repeating values in a text column with blanks.
```{r}
mtcars %>%
tibble::rownames_to_column("name") %>%
slice(1:10) %>%
select(cyl, name, mpg) %>%
arrange(cyl, mpg) %>%
mutate_at(c("cyl"), replace_if_same_as_last, "") %>%
knitr::kable()
```
`fct_add_counts()` adds counts to a factor's labels.
```{r}
# Create a factor with some random counts
set.seed(20190124)
random_iris <- iris %>%
dplyr::sample_n(250, replace = TRUE)
table(random_iris$Species)
# Updated factors
random_iris$Species %>% levels()
random_iris$Species %>% fct_add_counts() %>% levels()
```
You can tweak the format for the first label. I like to use this for plotting by
stating the unit next to the first count.
```{r}
random_iris$Species %>%
fct_add_counts(first_fmt = "{levels} ({counts} flowers)") %>%
levels()
```
## More involved demos
These are things that I would have used in the demo above but cut and moved
down here to keep that overview succinct.
### Comparing pairs of values over a posterior distribution
I wrote `compare_pairs()` to compute posterior differences in Bayesian models.
For the sake of example, let's fit a Bayesian model of average sepal length for
each species in `iris`. We could get these estimates more directly using the default dummy-coding of factors, but let's ignore that for now.
```{r, results = "hide"}
library(rstanarm)
m <- stan_glm(
Sepal.Length ~ Species - 1,
iris,
family = gaussian)
```
Now, we have a posterior distribution of species means.
```{r}
newdata <- data.frame(Species = unique(iris$Species))
p_means <- posterior_linpred(m, newdata = newdata) %>%
as.data.frame() %>%
tibble::as_tibble() %>%
setNames(newdata$Species) %>%
tibble::rowid_to_column("draw") %>%
tidyr::gather(species, mean, -draw) %>%
print()
```
For each posterior sample, we can compute pairwise differences of means with
`compare_means()`.
```{r pairs, fig.width = 4, fig.height = 2.5}
pair_diffs <- compare_pairs(p_means, species, mean) %>%
print()
library(ggplot2)
ggplot(pair_diffs) +
aes(x = pair, y = value) +
stat_summary(fun.data = median_hilow, geom = "linerange") +
stat_summary(fun.data = median_hilow, fun.args = list(conf.int = .8),
size = 2, geom = "linerange") +
stat_summary(fun.y = median, size = 5, shape = 3, geom = "point") +
labs(x = NULL, y = "Difference in posterior means") +
coord_flip()
```
...which should look like the effect ranges in the dummy-coded models.
```{r, results = "hide"}
m2 <- update(m, Sepal.Length ~ Species)
m3 <- update(m, Sepal.Length ~ Species,
data = iris %>% mutate(Species = forcats::fct_rev(Species)))
```
Give or take a few decimals of precision and give or take changes in signs
because of changes in who was subtracted from whom.
```{r}
# setosa verus others
m2 %>%
posterior_interval(regex_pars = "Species") %>%
round(2)
# virginica versus others
m3 %>%
rstanarm::posterior_interval(regex_pars = "Species") %>%
round(2)
# differences from compare_pairs()
pair_diffs %>%
tidyr::spread(pair, value) %>%
select(-draw) %>%
as.matrix() %>%
posterior_interval() %>%
round(2)
```