forked from betanalpha/knitr_case_studies
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathfitting_the_cauchy.Rmd
551 lines (438 loc) · 19.9 KB
/
fitting_the_cauchy.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
---
title: "Fitting The Cauchy Distribution"
author: "Michael Betancourt"
date: "January 2018"
output:
html_document:
fig_caption: yes
theme: spacelab #sandstone #spacelab #flatly
highlight: pygments
toc: TRUE
toc_depth: 2
number_sections: TRUE
toc_float:
smooth_scroll: FALSE
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(comment=NA)
```
The Cauchy density function enjoys a notorious reputation in mathematics,
serving as a common counterexample to identify incomplete proofs in probability
theory. At the same time it has also been a consistent source of frustration
for statistical computation. In this case study I review various ways of
implementing the Cauchy distribution, from the nominal implementation to
alternative implementations aimed at ameliorating these difficulties, and
demonstrate their relative performance.
# The Cauchy Distribution
The Cauchy distribution seems innocent enough, with a probability density
function given by the rational function,
$$
\pi(x) = \frac{1}{\pi \, s}
\frac{ s^{2} }{ (x - m)^{2} + s^{2}}.
$$
Here the location, $m$, quantifies where the corresponding probability
distribution concentrates while the scale, $s$, quantifies the breadth of that
distribution.
```{r}
c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")
x <- seq(-10, 10, 0.001)
plot(x, dcauchy(x, location = 0, scale = 1), type="l", col=c_dark_highlight, lwd=2,
main="", xlab="x", ylab="Probability Density", yaxt='n')
```
This probability density function, however, defines a probability distribution
with extremely long tails that places significant probability in neighborhoods
far away from $x = m$. We can see just how heavy these tails are by comparing
the quantile function of the standard Cauchy density to that of a standard
Gaussian density,
```{r}
x <- seq(0, 1, 0.001)
plot(x, qcauchy(x, location = 0, scale = 1), type="l", col=c_dark_highlight, lwd=2,
main="", xlab="Probability", ylab="Quantile")
lines(x, qnorm(x, 0, 1), type="l", col=c_light_highlight, lwd=2)
text(x=0.9, y=250, labels="Cauchy", col=c_dark_highlight)
text(x=0.9, y=-50, labels="Normal", col=c_light_highlight)
```
$95\%$ of the probability of the Gaussian distribution is contained within a
distance of $x = 1.6$ from $x = m = 0$, but we would have to go to $x = 6.3$ to
contain the same probability for the Cauchy distribution. In order to contain
$99\%$ of the probability we would need to go to only $x = 2.3$ for the Gaussian
distribution, but all the way to $x = 31.8$ for the Cauchy distribution!
These extremely heavy tails yield all kinds of surprising behavior for
distributions defined with Cauchy density functions. All of the even moments of
the distribution are infinite whereas all of the odd moments are not even well
defined. In order to avoid pathological behavior we have to restrict ourselves
to characterizing these Cauchy distributions with quantiles.
The Cauchy density was once recommended as the default for weakly informative
prior densities, but the behavior of these heavy tails, especially the
weak containment of probability around the location, $x = m$, has proven to be
too ungainly in practice. Consequently we have since moved towards recommending
Gaussian densities for weakly informative prior densities. Still, because the
Cauchy density arises as a component in more sophisticated prior densities, such
as the horseshoe and its generalizations, understanding how to best fit the it
remains important.
# The Nominal Implementation of the Cauchy Density Function
The heavy tails of the Cauchy density function make it notoriously difficult
to estimate its expectation values. In particular Random Walk Metropolis, the
Metropolis-Adjusted Langevin Algorithm, and even static Hamiltonian Monte Carlo
all fail to provide accurately estimates for these expectations. The problem
is that for accurate estimation any algorithm will have to explore the massive
extent of the heavy tails, but once out in those tails most algorithms have
difficulty returning back to the bulk of the distribution around $m = 0$.
How does the dynamic Hamiltonian Monte Carlo method used in Stan fare?
It's easy enough to check, here with a product of fifty Cauchy density
functions and the indicator function for the interval $-1 \le x \le 1$,
```{r}
writeLines(readLines("cauchy_nom.stan"))
```
Pushing the program through Stan,
```{r}
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())
util <- new.env()
source('stan_utility.R', local=util)
source('plot_utility.R', local=util)
fit_nom <- stan(file='cauchy_nom.stan', seed=4938483,
warmup=1000, iter=11000, control=list(max_treedepth=20))
util$check_all_diagnostics(fit_nom, max_depth=20)
```
After increasing the maximum treedepth to a pretty high value we see no
indications of pathological fitting behavior. We do, however, see $\hat{k}$
warnings which indicate that the $x$ parameters probably don't have finite
means and variances and hence well-defined effective sample sizes!
Consequently we shouldn't try to estimate the means or the estimated effective
samples sizes of these parameters and instead focus on quantiles of each $x$
or the mean of the included indicator function, $I$.
Indeed Stan is able to recover the $5\%$, $50\%$, and $95\%$ quantiles of
each parameter quite accurately,
```{r}
util$plot_estimated_quantiles(fit_nom, "Nominal Parameterization")
```
The variable integration times in dynamic Hamiltonian Monte Carlo allow
extremely long trajectories once we're out in the tails of the Cauchy
density function. These trajectories very slowly, but surely, carry us back
into the bulk of the distribution no matter how far into the tails we have
sojourned. The dynamic nature of the trajectories is important here -- once we
are deep enough into the tails any static but finite integration time will not
be long enough to ensure that we return!
Considering that most algorithms fail to accurately fit distributions defined
with Cauchy density functions, this is a pretty remarkable achievement for Stan.
That said, the long trajectories required for this accuracy come at a
significant computational cost. Consequently let's consider alternative
implementations of the Cauchy distribution that can be fit with fewer
computational resources.
# First Alternative Implementation
An interesting property of the Cauchy density function is that it arises as a
scale mixture of Gaussian density functions,
$$
\text{Cauchy}(x \mid m, s)
= \int \mathrm{d} \tau \,
\text{Normal} \, \left(x \mid m, \tau^{-\frac{1}{2}} \right)
\, \text{Gamma} \, \left(\tau \mid \frac{1}{2}, \frac{s^{2}}{2} \right).
$$
In other words, if
$$
x_{a} \sim \text{Normal} \, \left(0, 1 \right)
$$
and
$$
x_{b} \sim \text{Gamma} \, \left(\frac{1}{2}, \frac{s^{2}}{2} \right)
$$
then
$$
x = m + \frac{ x_{a} }{ \sqrt{x_{b}} }
$$
follows a distribution specified by a $\text{Cauchy}(m, s)$ density function.
Note that I am using the Stan conventions for the normal and gamma density
function parameterizations.
This property suggests a parameter expansion approach to implementing the Cauchy
distribution where we fit $x_{a}$ and $x_{b}$ and then derive $x$
deterministically. Although this requires twice the number of parameters, the
resulting joint density function is much more concentrated than the Cauchy
density function and hence significantly easier to fit.
```{r}
x <- seq(-3, 3, 0.05)
y <- seq(-9, 1, 0.05)
n_x <- length(x)
n_y <- length(y)
z <- matrix(nrow=n_x, ncol=n_y)
for (i in 1:n_x) for (j in 1:n_y)
z[i, j] <- dnorm(x[i], 0, 1) * dgamma(exp(y[j]), 0.5, 1 / 0.5) * exp(y[j])
contour(x, y, z, levels=seq(0.05, 1, 0.05) * max(z), drawlabels=FALSE,
main="First Alternative", xlab="x_a", ylab="log(x_b)",
col=c_dark_highlight, lwd=2)
```
This alternative implementation is straightforward to write as a Stan program,
```{r}
writeLines(readLines("cauchy_alt_1.stan"))
```
and readily fit,
```{r}
fit_1 <- stan(file='cauchy_alt_1.stan', seed=4938483,
warmup=1000, iter=11000)
util$check_all_diagnostics(fit_1)
```
There are no indications of pathological behavior with Stan's default
settings, although we are again warned to be careful about interpreting
expectations of the reconstructed Cauchy parameters. The quantiles of each of
those recovered Cauchy parameters, however, once again prove to be no problem,
```{r}
util$plot_estimated_quantiles(fit_1, "First Alternative")
```
# Second Alternative Implementation
This first alternative implementation immediately suggests a second. We can
avoid the division in the recovery of $x$ by giving $x_{b}$ an _inverse_ gamma
density function. Here we let
$$
x_{a} \sim \text{Normal} \, \left(0, 1 \right)
$$
and
$$
x_{b} \sim \text{Inv-Gamma} \, \left(\frac{1}{2}, \frac{s^{2}}{2} \right)
$$
from which
$$
x = m + x_{a} \cdot \sqrt{x_{b}}
$$
will follow a distribution specified by a $\text{Cauchy}(m, s)$ density function.
Although this may seem like a small change, the division operator and its
derivatives are significantly slower to evaluate than the multiplication
operator and its derivatives. Hence the change can yield nontrivial performance
improvements.
This small change yields a joint density function that mirrors the joint density
function of the first alternative implementation and its pleasant geometry.
```{r}
x <- seq(-3, 3, 0.05)
y <- seq(-1, 9, 0.05)
n_x <- length(x)
n_y <- length(y)
z <- matrix(nrow=n_x, ncol=n_y)
for (i in 1:n_x) for (j in 1:n_y)
z[i, j] <- dnorm(x[i], 0, 1) * dgamma(exp(-y[j]), 0.5, 1 / 0.5) * exp(-y[j])
contour(x, y, z, levels=seq(0.05, 1, 0.05) * max(z), drawlabels=FALSE,
main="Second Alternative", xlab="x_a", ylab="log(x_b)",
col=c_dark_highlight, lwd=2)
```
The corresponding Stan program is given by a small tweak
```{r}
writeLines(readLines("cauchy_alt_2.stan"))
```
and the fit
```{r}
fit_2 <- stan(file='cauchy_alt_2.stan', seed=4938483,
warmup=1000, iter=11000)
util$check_all_diagnostics(fit_2)
```
proceeds with no issues save for the warnings about the recovered Cauchy
parameters. As before, we accurately recover the quantiles of each Cauchy
distributed component in our model,
```{r}
util$plot_estimated_quantiles(fit_2, "Second Alternative")
```
# Third Alternative Implementation
The final alternative implementation that we will consider utilizes the
inverse cumulative distribution function of the Cauchy density. In particular,
if
$$
\tilde{x} \sim \text{Uniform} \, \left(0, 1 \right)
$$
then
$$
x = m + s \cdot \tan \, \left(\pi \left(\tilde{x} - \frac{1}{2} \right) \right)
$$
follows a distribution specified by a $\text{Cauchy}(m, s)$ density function.
The latent density of $\text{logit}(\tilde{x})$, which is what Stan
ultimately utilizes, exhibits much more reasonable tails than the Cauchy
distribution,
```{r}
x <- seq(-10, 10, 0.001)
plot(x, exp(-x) / (1 + exp(-x))**2, type="l", col=c_dark_highlight, lwd=2,
main="Third Alternative", xlab="logit(x_tilde)",
ylab="Probability Density", yaxt='n')
```
This final implementation requires only a sparse Stan program,
```{r}
writeLines(readLines("cauchy_alt_3.stan"))
```
and the fit proceeds without any indications of problems,
```{r}
fit_3 <- stan(file='cauchy_alt_3.stan', seed=4938483,
warmup=1000, iter=11000)
util$check_all_diagnostics(fit_3)
```
Once again by modifying the geometry of the target distribution we
accurately recover the quantiles of the Cauchy distribution without
the excessive cost of its nominal implementation,
```{r}
util$plot_estimated_quantiles(fit_3, "Third Alternative")
```
# Comparing Performance of the Various Implementations
To quantify how much better these alternative implementations perform let's
consider the only well-defined performance metric for Markov chain Monte Carlo:
the number of effective samples per computational cost. Here we use run time as
a proxy for computational cost and consider the effective sample size per time
for the indicator function constructed in each of our models which, unlike the
independent Cauchy components, has a finite mean and variance and hence
well-defined effective sample size.
```{r}
r_nom <- (summary(fit_nom, probs=NA)$summary)[51,5] /
sum(get_elapsed_time(fit_nom)[,2])
r_1 <- (summary(fit_1, probs=NA)$summary)[151,5] /
sum(get_elapsed_time(fit_1)[,2])
r_2 <- (summary(fit_2, probs=NA)$summary)[151,5] /
sum(get_elapsed_time(fit_2)[,2])
r_3 <- (summary(fit_3, probs=NA)$summary)[101,5] /
sum(get_elapsed_time(fit_3)[,2])
plot(1:4, c(r_nom, r_1, r_2, r_3), type="l", lwd=2, col=c_dark,
main="", xaxt = "n", xlab="", ylab="ESS / Time (s)",
xlim=c(0, 5), ylim=c(0, 4000))
axis(1, at=1:4, labels=c("Nom", "Alt 1", "Alt 2", "Alt 3"))
```
Immediately we see that the alternative implementations are drastically better
than the nominal implementation. Moreover, the second alternative
implementation is almost twice as fast as the the first. Perhaps surprisingly,
using an inverse Gamma distribution to avoid a division ends up being a
significant improvement.
We can also compare the geometry of these implementations by considering only
the number of gradient evaluations per iteration and ignoring the actual cost of
each evaluation.
```{r}
probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
sampler_params <- get_sampler_params(fit_nom, inc_warmup=FALSE)
n_grad_nom <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
probs = probs)
sampler_params <- get_sampler_params(fit_1, inc_warmup=FALSE)
n_grad_1 <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
probs = probs)
sampler_params <- get_sampler_params(fit_2, inc_warmup=FALSE)
n_grad_2 <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
probs = probs)
sampler_params <- get_sampler_params(fit_3, inc_warmup=FALSE)
n_grad_3 <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
probs = probs)
idx <- c(1, 1, 2, 2, 3, 3, 4, 4)
x <- c(0.5, 1.5, 1.5, 2.5, 2.5, 3.5, 3.5, 4.5)
cred <- data.frame(n_grad_nom, n_grad_1, n_grad_2, n_grad_3)
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n]))
plot(1, type="n", main="", xaxt = "n", xlab="",
ylab="Gradient Evaluations Per Iteration",
xlim=c(0.5, 4.5), ylim=c(0, 8000))
axis(1, at=1:4, labels=c("Nom", "Alt 1", "Alt 2", "Alt 3"))
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
```
As expected the alternative implementations require far fewer gradient
evaluations per iteration as they don't exhibit the heavy tails which require increasingly longer and longer trajectories.
Zooming in to the alternative implementations,
```{r}
probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)
idx <- c(1, 1, 2, 2, 3, 3)
x <- c(0.5, 1.5, 1.5, 2.5, 2.5, 3.5)
cred <- data.frame(n_grad_1, n_grad_2, n_grad_3)
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n]))
plot(1, type="n", main="", xaxt = "n", xlab="",
ylab="Gradient Evaluations Per Iteration",
xlim=c(0.5, 3.5), ylim=c(0, 50))
axis(1, at=1:3, labels=c("Alt 1", "Alt 2", "Alt 3"))
polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
```
we see that the third implementation requires the fewest gradient evaluations
and hence features the nicest geometry. That said, the tangent function is
more expensive then the cost of evaluating the gamma and inverse gamma density
functions and in terms of overall efficiency the second implementation proves
superior.
# The Half Cauchy Density Function
These alternative implementations can also be modified to work for the half
Cauchy density function that is defined over only positive values. In
particular, for the first and second alternative implementations we just have to constrain $x_{a}$ to be positive. In the third we just have to set
$$
x = \tan \, \left(\frac{\pi}{2} \tilde{x} \right)
$$
to yield a $\text{Half-Cauchy}(0, 1)$ distribution.
To demonstrate let's compare a nominal half Cauchy implementation,
```{r}
writeLines(readLines("half_cauchy_nom.stan"))
fit_half_nom <- stan(file='half_cauchy_nom.stan', seed=4938483,
warmup=1000, iter=11000)
util$check_all_diagnostics(fit_half_nom)
```
to the second alternative implementation,
```{r}
writeLines(readLines("half_cauchy_alt.stan"))
fit_half_reparam <- stan(file='half_cauchy_alt.stan', seed=4938483,
warmup=1000, iter=11000)
util$check_all_diagnostics(fit_half_reparam)
```
Comparing the 10th identical component of our model we see that the two
implementations yield equivalent results,
```{r}
x <- extract(fit_half_nom)$x[,10]
p1 <- hist(x[x < 25], breaks=seq(0, 25, 0.25), plot=FALSE)
p1$counts <- p1$counts / sum(p1$counts)
x <- extract(fit_half_reparam)$x[,10]
p2 <- hist(x[x < 25], breaks=seq(0, 25, 0.25), plot=FALSE)
p2$counts <- p2$counts / sum(p2$counts)
c_light_trans <- c("#DCBCBC80")
c_light_highlight_trans <- c("#C7999980")
c_dark_trans <- c("#8F272780")
c_dark_highlight_trans <- c("#7C000080")
plot(p1, col=c_dark_trans, border=c_dark_highlight_trans,
main="", xlab="x[10]", yaxt='n', ylab="")
plot(p2, col=c_light_trans, border=c_light_highlight_trans, add=T)
```
# Conclusion
Although the nominal implementation of the Cauchy density function frustrates
computational algorithms with either bias or slow execution, there exist
multiple alternative implementations that yield an equivalent distribution
without the excessive cost.
The practical consequences of these implementations, however, may not ultimately
be all that significant. Having a Cauchy distribution in the generative model
does not necessarily imply that the heavy tails will persist into the posterior
distribution. In many cases even a little bit of data can tame the heavy tails,
resulting in a pleasant posterior geometry regardless of which implementation of
the Cauchy we use. Nevertheless it is up the the user to remain vigilant and
monitor for indications of heavy tails and motivation for alternative
implementations.
# Acknowledgements
I thank Aki Vehtari and Junpeng Lao for helpful comments.
A very special thanks to everyone supporting me on Patreon: Aki Vehtari,
Alan O'Donnell, Andre Zapico, Andrew Rouillard, Austin Rochford, Avraham Adler,
Bo Schwartz Madsen, Bryan Yu, Cat Shark, Charles Naylor, Christopher Howlin,
Colin Carroll, Daniel Simpson, David Pascall, David Roher, David Stanard,
Ed Cashin, Eddie Landesberg, Elad Berkman, Eric Jonas, Ethan Goan,
Finn Lindgren, Granville Matheson, Hernan Bruno, J Michael Burgess,
Joel Kronander, Jonas Beltoft Gehrlein, Joshua Mayer, Justin Bois,
Lars Barquist, Luiz Carvalho, Marek Kwiatkowski, Matthew Kay,
Maurits van der Meer, Maxim Kesin, Michael Dillon, Michael Redman,
Noah Silverman, Ole Rogeberg, Oscar Olvera, Paul Oreto, Peter Heinrich,
Putra Manggala, Ravin Kumar, Riccardo Fusaroli, Richard Torkar, Robert Frost,
Robin Taylor, Sam Petulla, Sam Zorowitz, Seo-young Kim, Seth Axen,
Sharan Banagiri, Simon Duane, Stephen Oates, Stijn, Vladislavs Dovgalecs,
and yolha.
# Original Computing Environment
```{r, comment=NA}
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
```
```{r, comment=NA}
devtools::session_info("rstan")
```