fitting_the_cauchy/fitting_the_cauchy.Rmd

---
title: "Fitting The Cauchy Distribution"
author: "Michael Betancourt"
date: "January 2018"
output:
  html_document:
    fig_caption: yes
    theme: spacelab #sandstone #spacelab #flatly
    highlight: pygments
    toc: TRUE
    toc_depth: 2
    number_sections: TRUE
    toc_float:
      smooth_scroll: FALSE
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(comment=NA)
```

The Cauchy density function enjoys a notorious reputation in mathematics, 
serving as a common counterexample to identify incomplete proofs in probability 
theory.  At the same time it has also been a consistent source of frustration 
for statistical computation.  In this case study I review various ways of 
implementing the Cauchy distribution, from the nominal implementation to 
alternative implementations aimed at ameliorating these difficulties, and 
demonstrate their relative performance.

# The Cauchy Distribution

The Cauchy distribution seems innocent enough, with a probability density 
function given by the rational function,
$$
\pi(x) = \frac{1}{\pi \, s}
\frac{ s^{2} }{ (x - m)^{2} + s^{2}}.
$$
Here the location, $m$, quantifies where the corresponding probability 
distribution concentrates while the scale, $s$, quantifies the breadth of that
distribution.

```{r}

c_light <- c("#DCBCBC")
c_light_highlight <- c("#C79999")
c_mid <- c("#B97C7C")
c_mid_highlight <- c("#A25050")
c_dark <- c("#8F2727")
c_dark_highlight <- c("#7C0000")

x <- seq(-10, 10, 0.001)
plot(x, dcauchy(x, location = 0, scale = 1), type="l", col=c_dark_highlight, lwd=2,
     main="", xlab="x", ylab="Probability Density", yaxt='n')
```

This probability density function, however, defines a probability distribution
with extremely long tails that places significant probability in neighborhoods 
far away from $x = m$.  We can see just how heavy these tails are by comparing 
the quantile function of the standard Cauchy density to that of a standard 
Gaussian density,

```{r}
x <- seq(0, 1, 0.001)
plot(x, qcauchy(x, location = 0, scale = 1), type="l", col=c_dark_highlight, lwd=2,
     main="", xlab="Probability", ylab="Quantile")
lines(x, qnorm(x, 0, 1), type="l", col=c_light_highlight, lwd=2)

text(x=0.9, y=250, labels="Cauchy", col=c_dark_highlight)
text(x=0.9, y=-50, labels="Normal", col=c_light_highlight)
```

$95\%$ of the probability of the Gaussian distribution is contained within a 
distance of $x = 1.6$ from $x = m = 0$, but we would have to go to $x = 6.3$ to
contain the same probability for the Cauchy distribution.  In order to contain 
$99\%$ of the probability we would need to go to only $x = 2.3$ for the Gaussian
distribution, but all the way to $x = 31.8$ for the Cauchy distribution!

These extremely heavy tails yield all kinds of surprising behavior for 
distributions defined with Cauchy density functions.  All of the even moments of 
the distribution are infinite whereas all of the odd moments are not even well
defined.  In order to avoid pathological behavior we have to restrict ourselves 
to characterizing these Cauchy distributions with quantiles.

The Cauchy density was once recommended as the default for weakly informative 
prior densities, but the behavior of these heavy tails, especially the
weak containment of probability around the location, $x = m$, has proven to be 
too ungainly in practice.  Consequently we have since moved towards recommending
Gaussian densities for weakly informative prior densities.  Still, because the
Cauchy density arises as a component in more sophisticated prior densities, such
as the horseshoe and its generalizations, understanding how to best fit the it
remains important.

# The Nominal Implementation of the Cauchy Density Function

The heavy tails of the Cauchy density function make it notoriously difficult
to estimate its expectation values.  In particular Random Walk Metropolis, the
Metropolis-Adjusted Langevin Algorithm, and even static Hamiltonian Monte Carlo
all fail to provide accurately estimates for these expectations.  The problem
is that for accurate estimation any algorithm will have to explore the massive
extent of the heavy tails, but once out in those tails most algorithms have
difficulty returning back to the bulk of the distribution around $m = 0$.

How does the dynamic Hamiltonian Monte Carlo method used in Stan fare?
It's easy enough to check, here with a product of fifty Cauchy density 
functions and the indicator function for the interval $-1 \le x \le 1$,

```{r}
writeLines(readLines("cauchy_nom.stan"))
```

Pushing the program through Stan,

```{r}
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

util <- new.env()
source('stan_utility.R', local=util)
source('plot_utility.R', local=util)

fit_nom <- stan(file='cauchy_nom.stan', seed=4938483,
                warmup=1000, iter=11000, control=list(max_treedepth=20))

util$check_all_diagnostics(fit_nom, max_depth=20)
```

After increasing the maximum treedepth to a pretty high value we see no
indications of pathological fitting behavior.  We do, however, see $\hat{k}$
warnings which indicate that the $x$ parameters probably don't have finite
means and variances and hence well-defined effective sample sizes!

Consequently we shouldn't try to estimate the means or the estimated effective
samples sizes of these parameters and instead focus on quantiles of each $x$
or the mean of the included indicator function, $I$.

Indeed Stan is able to recover the $5\%$, $50\%$, and $95\%$ quantiles of
each parameter quite accurately,

```{r}
util$plot_estimated_quantiles(fit_nom, "Nominal Parameterization")
```

The variable integration times in dynamic Hamiltonian Monte Carlo allow 
extremely long trajectories once we're out in the tails of the Cauchy
density function.  These trajectories very slowly, but surely, carry us back 
into the bulk of the distribution no matter how far into the tails we have
sojourned.  The dynamic nature of the trajectories is important here -- once we 
are deep enough into the tails any static but finite integration time will not 
be long enough to ensure that we return!

Considering that most algorithms fail to accurately fit distributions defined
with Cauchy density functions, this is a pretty remarkable achievement for Stan.
That said, the long trajectories required for this accuracy come at a 
significant computational cost.  Consequently let's consider alternative
implementations of the Cauchy distribution that can be fit with fewer 
computational resources.

# First Alternative Implementation

An interesting property of the Cauchy density function is that it arises as a 
scale mixture of Gaussian density functions,
$$
\text{Cauchy}(x \mid m, s)
= \int \mathrm{d} \tau \,
\text{Normal} \, \left(x \mid m, \tau^{-\frac{1}{2}} \right)
\, \text{Gamma} \, \left(\tau \mid \frac{1}{2}, \frac{s^{2}}{2} \right).
$$
In other words, if
$$
x_{a} \sim \text{Normal} \, \left(0, 1 \right)
$$
and
$$
x_{b} \sim \text{Gamma} \, \left(\frac{1}{2}, \frac{s^{2}}{2} \right)
$$
then
$$
x = m + \frac{ x_{a} }{ \sqrt{x_{b}} }
$$
follows a distribution specified by a $\text{Cauchy}(m, s)$ density function.  
Note that I am using the Stan conventions for the normal and gamma density 
function parameterizations.

This property suggests a parameter expansion approach to implementing the Cauchy 
distribution where we fit $x_{a}$ and $x_{b}$ and then derive $x$ 
deterministically.  Although this requires twice the number of parameters, the
resulting joint density function is much more concentrated than the Cauchy 
density function and hence significantly easier to fit.

```{r}
x <- seq(-3, 3, 0.05)
y <- seq(-9, 1, 0.05)

n_x <- length(x)
n_y <- length(y)
z <- matrix(nrow=n_x, ncol=n_y)
for (i in 1:n_x) for (j in 1:n_y)
  z[i, j] <- dnorm(x[i], 0, 1) * dgamma(exp(y[j]), 0.5, 1 / 0.5) * exp(y[j])

contour(x, y, z, levels=seq(0.05, 1, 0.05) * max(z), drawlabels=FALSE,
        main="First Alternative", xlab="x_a", ylab="log(x_b)",
        col=c_dark_highlight, lwd=2)
```

This alternative implementation is straightforward to write as a Stan program,

```{r}
writeLines(readLines("cauchy_alt_1.stan"))
```

and readily fit,

```{r}
fit_1 <- stan(file='cauchy_alt_1.stan', seed=4938483,
              warmup=1000, iter=11000)

util$check_all_diagnostics(fit_1)
```

There are no indications of pathological behavior with Stan's default
settings, although we are again warned to be careful about interpreting
expectations of the reconstructed Cauchy parameters.  The quantiles of each of 
those recovered Cauchy parameters, however, once again prove to be no problem,

```{r}
util$plot_estimated_quantiles(fit_1, "First Alternative")
```

# Second Alternative Implementation

This first alternative implementation immediately suggests a second.  We can 
avoid the division in the recovery of $x$ by giving $x_{b}$ an _inverse_ gamma
density function.  Here we let
$$
x_{a} \sim \text{Normal} \, \left(0, 1 \right)
$$
and
$$
x_{b} \sim \text{Inv-Gamma} \, \left(\frac{1}{2}, \frac{s^{2}}{2} \right)
$$
from which
$$
x = m + x_{a} \cdot \sqrt{x_{b}}
$$
will follow a distribution specified by a $\text{Cauchy}(m, s)$ density function.

Although this may seem like a small change, the division operator and its
derivatives are significantly slower to evaluate than the multiplication
operator and its derivatives.  Hence the change can yield nontrivial performance
improvements.

This small change yields a joint density function that mirrors the joint density
function of the first alternative implementation and its pleasant geometry.

```{r}
x <- seq(-3, 3, 0.05)
y <- seq(-1, 9, 0.05)

n_x <- length(x)
n_y <- length(y)
z <- matrix(nrow=n_x, ncol=n_y)
for (i in 1:n_x) for (j in 1:n_y)
  z[i, j] <- dnorm(x[i], 0, 1) * dgamma(exp(-y[j]), 0.5, 1 / 0.5) * exp(-y[j])

contour(x, y, z, levels=seq(0.05, 1, 0.05) * max(z), drawlabels=FALSE,
        main="Second Alternative", xlab="x_a", ylab="log(x_b)",
        col=c_dark_highlight, lwd=2)
```

The corresponding Stan program is given by a small tweak

```{r}
writeLines(readLines("cauchy_alt_2.stan"))
```

and the fit

```{r}
fit_2 <- stan(file='cauchy_alt_2.stan', seed=4938483,
              warmup=1000, iter=11000)

util$check_all_diagnostics(fit_2)
```

proceeds with no issues save for the warnings about the recovered Cauchy
parameters.  As before, we accurately recover the quantiles of each Cauchy
distributed component in our model,

```{r}
util$plot_estimated_quantiles(fit_2, "Second Alternative")
```

# Third Alternative Implementation

The final alternative implementation that we will consider utilizes the
inverse cumulative distribution function of the Cauchy density.  In particular, 
if
$$
\tilde{x} \sim \text{Uniform} \, \left(0, 1 \right)
$$
then
$$
x = m + s \cdot \tan \, \left(\pi \left(\tilde{x} - \frac{1}{2} \right) \right)
$$
follows a distribution specified by a $\text{Cauchy}(m, s)$ density function.

The latent density of $\text{logit}(\tilde{x})$, which is what Stan
ultimately utilizes, exhibits much more reasonable tails than the Cauchy
distribution,

```{r}
x <- seq(-10, 10, 0.001)
plot(x, exp(-x) / (1 + exp(-x))**2, type="l", col=c_dark_highlight, lwd=2,
     main="Third Alternative", xlab="logit(x_tilde)", 
     ylab="Probability Density", yaxt='n')
```

This final implementation requires only a sparse Stan program,

```{r}
writeLines(readLines("cauchy_alt_3.stan"))
```

and the fit proceeds without any indications of problems,

```{r}
fit_3 <- stan(file='cauchy_alt_3.stan', seed=4938483,
              warmup=1000, iter=11000)

util$check_all_diagnostics(fit_3)
```

Once again by modifying the geometry of the target distribution we
accurately recover the quantiles of the Cauchy distribution without
the excessive cost of its nominal implementation,

```{r}
util$plot_estimated_quantiles(fit_3, "Third Alternative")
```

# Comparing Performance of the Various Implementations

To quantify how much better these alternative implementations perform let's 
consider the only well-defined performance metric for Markov chain Monte Carlo: 
the number of effective samples per computational cost.  Here we use run time as 
a proxy for computational cost and consider the effective sample size per time 
for the indicator function constructed in each of our models which, unlike the 
independent Cauchy components, has a finite mean and variance and hence 
well-defined effective sample size.

```{r}
r_nom <- (summary(fit_nom, probs=NA)$summary)[51,5] /
         sum(get_elapsed_time(fit_nom)[,2])
r_1 <- (summary(fit_1, probs=NA)$summary)[151,5] /
        sum(get_elapsed_time(fit_1)[,2])
r_2 <- (summary(fit_2, probs=NA)$summary)[151,5] /
         sum(get_elapsed_time(fit_2)[,2])
r_3 <- (summary(fit_3, probs=NA)$summary)[101,5] /
        sum(get_elapsed_time(fit_3)[,2])

plot(1:4, c(r_nom, r_1, r_2, r_3), type="l", lwd=2, col=c_dark,
     main="", xaxt = "n", xlab="", ylab="ESS / Time (s)",
     xlim=c(0, 5), ylim=c(0, 4000))
axis(1, at=1:4, labels=c("Nom", "Alt 1", "Alt 2", "Alt 3"))
```

Immediately we see that the alternative implementations are drastically better 
than the nominal implementation.  Moreover, the second alternative 
implementation is almost twice as fast as the the first.  Perhaps surprisingly, 
using an inverse Gamma distribution to avoid a division ends up being a
significant improvement.

We can also compare the geometry of these implementations by considering only 
the number of gradient evaluations per iteration and ignoring the actual cost of 
each evaluation.

```{r}
probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)

sampler_params <- get_sampler_params(fit_nom, inc_warmup=FALSE)
n_grad_nom <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
                       probs = probs)

sampler_params <- get_sampler_params(fit_1, inc_warmup=FALSE)
n_grad_1 <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
                     probs = probs)

sampler_params <- get_sampler_params(fit_2, inc_warmup=FALSE)
n_grad_2 <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
                     probs = probs)

sampler_params <- get_sampler_params(fit_3, inc_warmup=FALSE)
n_grad_3 <- quantile(do.call(rbind, sampler_params)[,'n_leapfrog__'],
                     probs = probs)

idx <- c(1, 1, 2, 2, 3, 3, 4, 4)
x <- c(0.5, 1.5, 1.5, 2.5, 2.5, 3.5, 3.5, 4.5)

cred <- data.frame(n_grad_nom, n_grad_1, n_grad_2, n_grad_3)
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n]))

plot(1, type="n", main="", xaxt = "n", xlab="",
     ylab="Gradient Evaluations Per Iteration",
     xlim=c(0.5, 4.5), ylim=c(0, 8000))
axis(1, at=1:4, labels=c("Nom", "Alt 1", "Alt 2", "Alt 3"))

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
```

As expected the alternative implementations require far fewer gradient 
evaluations per iteration as they don't exhibit the heavy tails which require increasingly longer and longer trajectories.

Zooming in to the alternative implementations,

```{r}
probs <- c(0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9)

idx <- c(1, 1, 2, 2, 3, 3)
x <- c(0.5, 1.5, 1.5, 2.5, 2.5, 3.5)

cred <- data.frame(n_grad_1, n_grad_2, n_grad_3)
pad_cred <- do.call(cbind, lapply(idx, function(n) cred[1:9,n]))

plot(1, type="n", main="", xaxt = "n", xlab="",
     ylab="Gradient Evaluations Per Iteration",
     xlim=c(0.5, 3.5), ylim=c(0, 50))
axis(1, at=1:3, labels=c("Alt 1", "Alt 2", "Alt 3"))

polygon(c(x, rev(x)), c(pad_cred[1,], rev(pad_cred[9,])),
        col = c_light, border = NA)
polygon(c(x, rev(x)), c(pad_cred[2,], rev(pad_cred[8,])),
        col = c_light_highlight, border = NA)
polygon(c(x, rev(x)), c(pad_cred[3,], rev(pad_cred[7,])),
        col = c_mid, border = NA)
polygon(c(x, rev(x)), c(pad_cred[4,], rev(pad_cred[6,])),
        col = c_mid_highlight, border = NA)
lines(x, pad_cred[5,], col=c_dark, lwd=2)
```

we see that the third implementation requires the fewest gradient evaluations
and hence features the nicest geometry.  That said, the tangent function is
more expensive then the cost of evaluating the gamma and inverse gamma density
functions and in terms of overall efficiency the second implementation proves
superior.

# The Half Cauchy Density Function

These alternative implementations can also be modified to work for the half 
Cauchy density function that is defined over only positive values.  In 
particular, for the first and second alternative implementations we just have to constrain $x_{a}$ to be positive.  In the third we just have to set
$$
x = \tan \, \left(\frac{\pi}{2} \tilde{x} \right)
$$
to yield a $\text{Half-Cauchy}(0, 1)$ distribution.

To demonstrate let's compare a nominal half Cauchy implementation,

```{r}
writeLines(readLines("half_cauchy_nom.stan"))

fit_half_nom <- stan(file='half_cauchy_nom.stan', seed=4938483,
                     warmup=1000, iter=11000)

util$check_all_diagnostics(fit_half_nom)
```

to the second alternative implementation,

```{r}
writeLines(readLines("half_cauchy_alt.stan"))

fit_half_reparam <- stan(file='half_cauchy_alt.stan', seed=4938483,
                         warmup=1000, iter=11000)

util$check_all_diagnostics(fit_half_reparam)
```

Comparing the 10th identical component of our model we see that the two 
implementations yield equivalent results,

```{r}
x <- extract(fit_half_nom)$x[,10]
p1 <- hist(x[x < 25], breaks=seq(0, 25, 0.25), plot=FALSE)
p1$counts <- p1$counts / sum(p1$counts)

x <- extract(fit_half_reparam)$x[,10]
p2 <- hist(x[x < 25], breaks=seq(0, 25, 0.25), plot=FALSE)
p2$counts <- p2$counts / sum(p2$counts)

c_light_trans <- c("#DCBCBC80")
c_light_highlight_trans <- c("#C7999980")
c_dark_trans <- c("#8F272780")
c_dark_highlight_trans <- c("#7C000080")

plot(p1, col=c_dark_trans, border=c_dark_highlight_trans,
     main="", xlab="x[10]", yaxt='n', ylab="")
plot(p2, col=c_light_trans, border=c_light_highlight_trans, add=T)
```

# Conclusion

Although the nominal implementation of the Cauchy density function frustrates
computational algorithms with either bias or slow execution, there exist 
multiple alternative implementations that yield an equivalent distribution 
without the excessive cost.

The practical consequences of these implementations, however, may not ultimately
be all that significant.  Having a Cauchy distribution in the generative model 
does not necessarily imply that the heavy tails will persist into the posterior
distribution.  In many cases even a little bit of data can tame the heavy tails, 
resulting in a pleasant posterior geometry regardless of which implementation of 
the Cauchy we use.  Nevertheless it is up the the user to remain vigilant and 
monitor for indications of heavy tails and motivation for alternative
implementations.

# Acknowledgements

I thank Aki Vehtari and Junpeng Lao for helpful comments.

A very special thanks to everyone supporting me on Patreon: Aki Vehtari, 
Alan O'Donnell, Andre Zapico, Andrew Rouillard, Austin Rochford, Avraham Adler, 
Bo Schwartz Madsen, Bryan Yu, Cat Shark, Charles Naylor, Christopher Howlin, 
Colin Carroll, Daniel Simpson, David Pascall, David Roher, David Stanard, 
Ed Cashin, Eddie Landesberg, Elad Berkman, Eric Jonas, Ethan Goan, 
Finn Lindgren, Granville Matheson, Hernan Bruno, J Michael Burgess, 
Joel Kronander, Jonas Beltoft Gehrlein, Joshua Mayer, Justin Bois, 
Lars Barquist, Luiz Carvalho, Marek Kwiatkowski, Matthew Kay, 
Maurits van der Meer, Maxim Kesin, Michael Dillon, Michael Redman, 
Noah Silverman, Ole Rogeberg, Oscar Olvera, Paul Oreto, Peter Heinrich, 
Putra Manggala, Ravin Kumar, Riccardo Fusaroli, Richard Torkar, Robert Frost, 
Robin Taylor, Sam Petulla, Sam Zorowitz, Seo-young Kim, Seth Axen, 
Sharan Banagiri, Simon Duane, Stephen Oates, Stijn, Vladislavs Dovgalecs, 
and yolha.

# Original Computing Environment

```{r, comment=NA}
writeLines(readLines(file.path(Sys.getenv("HOME"), ".R/Makevars")))
```

```{r, comment=NA}
devtools::session_info("rstan")
```