layout | output | title | date | date_published | description | authors | bibliography | toc | ||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
distill |
|
Generalizing Bayesian Inference |
05-08-2021 |
0 |
Updating a 250 years old theorem for the 21st century |
|
2021-08-05-generalizedBayes.bib |
|
If you are reading this, you probably know what Bayes' theorem is. Here, we are concerned with the use of Bayes' theorem to perform inference for parameters of a statistical model (a.k.a. Bayesian inference).
Recently, several generalizations of Bayesian inference have been proposed. The aim of this post is to survey some of these generalizations and overview the pros and cons of each with respect to using the original Bayes theorem. Hopefully, I'll be able to put some order in this rapidly expanding literature and provide some intuition on why we need to move beyond the original Bayes theorem.
TL; DR: Extensions of Bayesian inference work better in some cases which are not contemplated by using the original Bayes' theorem (primarily, model misspecification), and provide more justification.
Let's first have a look at Bayes' original theorem This is actually the form that was given to it by Laplace; click here for a nice history of the theorem. :
Here, I am using $$ \theta $$ to denote a parameter on which we want to perform inference, while
As the standard textbook description of Bayesian inference says, Bayes' theorem provides a posterior belief $$ \pi(\theta\vert x) $$, by updating the prior belief $$ \pi(\theta) $$ with the information on
Actually, the denominator in the definition of the posterior is independent on
where the proportional sign refers to both sides of the equation being considered as function of $$\theta$$When some posterior expectations need to be computed, the function on the right-hand side needs to be normalized (ie divided by the normalizing constant $p(x)$). However, computing the normalizing constant is not needed when sampling from the posterior with MCMC methods, for instance..
Additionally, Bayes' theorem is very modular, allowing to sequentially incorporate new information into our belief: say that we previously had data
So, if we get
The posterior obtained from Bayes' theorem satisfies some nice properties; first, there are properties that are intrinsic to Bayes' update rule:
-
coherence: in whatever order you process the observations
$$x_i$$ , you end up with the same posterior beliefThis property may be given different names in other works, as for instance Bayesian additivity.. -
Likelihood principle: As the observations
$$\mathbf x$$ only appear through the likelihood in the definition of the posterior, Bayes' theorem satisfies the likelihood principle, which says that (from Wikipedia):given a statistical model, all the evidence in a sample relevant to model parameters is contained in the likelihood function.
If you further assumes that the observations $$ \mathbf x$$ are generated from
-
according to information theory arguments in Zellner , Bayes theorem is the optimal way to process information, in the sense that its use does not discard any information present in the data about the parameters.
-
An analogue to the Central Limit Theorem in frequentist statistics applies to Bayes' posterior, called Bernstein-von Mises theorem. It goes like this: as the number of observations
$$n$$ goes to infinity, the posterior converges to a normal distribution which is centered in$$\theta_0$$ (and whose variance decreases as$$1/n$$ and is asymptotically equivalent to the sampling variance of the maximum likelihood estimator of$$\theta$$ )The theorem of course holds under some regularity conditions, such as$\pi(\theta_0)>0$ ; click here for an introduction and some further references. . That means that Bayesian inference is, asymptotically, equivalent to maximum likelihood estimation, as it recovers the exact parameter value$$\theta_0$$ .
To recap what we said above, the main motivation behind Bayes' theorem is that, if the model
However, these arguments vacillate when the model is not an accurate representation of the distribution
- Zellner's argument simply does not hold anymore .
- A Bernstein von-Mises result still holds, but now the asymptotic normal distribution will be centered in the parameter value
$$\theta^\star = \arg \min_{\theta} D_{\text{KL}}(g(\cdot)\vert \vert p(\cdot|\theta)),$$ where$$D_{\text{KL}}$$ is the Kullback-Leibler (KL) divergence Notice that$\theta^\star$ is also the parameter value to which the frequentist maximum likelihood estimate converges.
More in general, we need to ask what is the aim of Bayesian inference in such a misspecified setting. In fact, with Bayesian inference we do not learn about the true parameter value anymore, as such thing does not exist. Rather, the standard Bayes' posterior learns about the parameter value such that the misspecified model is as close as possible to the DGP in the specific sense of the KL divergence.
That may still be what you want to do in some cases, but I argue below that using the KL divergence may behave poorly in some misspecified cases; instead, some generalized Bayesian inference allow the user to choose the way in which you approximate the DGP with the probabilistic model (for instance replacing the KL with other divergences ). Others instead completely dispose of a probabilistic model.
Learning
this means that 90% of observations are generated from the model for a given parameter value
With more general misspecifications, things may go wrong in different ways.
Finally, Bayesian inference implicitly assumes that the prior is a good representation of previous knowledge, and that enough computational power is available to sample from the posterior (which, in some cases, it is very expensive to do) .
Some of the generalized Bayesian approaches explicitly include the fact that these assumptions may be broken in the definition of the inference strategy, as we will see below.
In the following, I review extensions to Bayes theorem which are more justified and may have better performances in a misspecified setting (or can even be used without a probabilistic model); some will also tackle directly the issues regarding prior and computational power.
Across these works, a recurrent underlying question is: what is the actual aim of inference when we cannot specify the model correctly?
Disclaimer: this overview is non-exhaustive and strongly biased due to papers I've read and my personal research activity.
A first idea to tackle (mild) model misspecification is to reduce the importance of the likelihood term in the definition of Bayes' posterior. This can be done by raising the likelihood function to a power
This idea has been discussed in detail in (and previous papers referred there). The authors of also propose a way to automatically tune
With respect to standard Bayes' posterior, this strategy does not change the parameter value on which learning is performed, but it only changes the speed of learning (ie the rate of concentration of the posterior distribution with an increasing number of observations). Also, it still satisfies the likelihood principle and coherence.
A larger leap is taken in . There, the authors replace the likelihood term with the exponential of a loss function:
\begin{equation}\label{Eq:bissiri} \pi_{\ell,w}(\theta\vert x) \propto \pi(\theta) \exp{ - w \cdot \ell (\theta,x)}, \end{equation}
where
where
In , they show how the update rule above (Eq. \eqref{Eq:bissiri}) can be derived axiomatically from the task of learning about the parameter value minimizing the expected loss, by assuming the observed data to be independent from the prior and inference to be coherent in the sense defined above (ie invariant to the order in which the observations were obtained).
An advantage to this approach is that you do not need to specify a probabilistic model (ie likelihood) here; inference can be performed on "parameter" values that are solely defined through the loss function.
Of course, the power likelihood posterior is obtained as a special case by setting
However, with a generic loss, setting the value of
Finally, notice that a Bernstein-von Mises result still hold for this more general posterior (of course under some regularity conditions). Some very general and applicable formulations are given in .
So, the loss-based approach in Eq. \eqref{Eq:bissiri} works without a probabilistic model. But what if you have a misspecified model which you believe carries some meaning about the process you are studying?
As mentioned above, using the original Bayes' posterior may not be a wise choice. In order to perform inference in a sounded way, an idea is to use the loss-based approach and express the loss
where
This approach has been investigated in some recent works, amongst which , , and . In this way, you learn about the parameter value minimizing the expected scoring rule over the data generating process
For some scoring rules
Clearly, this still gives a coherent update, but the likelihood principle is not satisfied anymore (as we use the likelihood, but do not just evaluate it at the observation); however, the likelihood principle itself does not seem very reasonable if the model is misspecified in the first place.
One more step towards generality and we find the approach presented in .
The idea is to start from the variational formulation of Bayes' posterior which is attributed to Donsker and Varadhan ; say we have
$$ \pi(\cdot|\mathbf x) = \underset{q \in \mathcal P (\Theta)}{\operatorname{argmin}}\left{\mathbb{E}{q(\theta)}\left[-\sum{i=1}^{n} \log \left(p\left(x_{i} \mid {\theta}\right)\right)\right]+D_{\text{KL}}(q | \pi)\right}, $$
where
This formulation leads to an optimization-centric view of Bayesian inference, as the authors of put it; additionally, the loss-based posterior in Eq. \eqref{Eq:bissiri} can be obtained in a similar fashion by just replacing the negative log-likelihood with the generic loss function
$$ \pi_{\ell,w}(\cdot|\mathbf x) = \underset{q \in \mathcal P (\Theta)}{\operatorname{argmin}}\left{\mathbb{E}{q(\theta)}\left[w \sum{i=1}^{n} \ell(\theta, x_i) \right]+D_{\text{KL}}(q | \pi)\right}. $$
This formulation open the doors to additional extensions by changing
$$ \pi_{\ell, D, \mathcal \Pi}(\cdot|\mathbf x) = \underset{q \in \mathcal \Pi }{\operatorname{argmin}}\left{\mathbb{E}{q(\theta)}\left[w \sum{i=1}^{n} \ell(\theta, x_i) \right]+D(q | \pi)\right} \stackrel{\text{def}}{=} P(\ell, D, \Pi). $$
However, it is in general impossible to obtain a closed form solution for the above problem. Additionally, you do not have a coherent update anymore.
In the paper , they show how the above problem generalizes standard Variational Inference (VI, in which you select the distribution in a given class which minimizes the KL divergence from the exact posterior), hence the name of the approach. Therefore, the RoT unifies under the same hat variational inference, loss based posteriors and the standard Bayes' posterior. It is important to notice how, instead, VI approaches which employ different divergences from the KL (which they call divergence VI) are not includedThis means that standard VI (with the KL) is theoretically optimal with respect to divergence VI as the former directly approximates the exact posterior. That however is restricted to the case in which the likelihood and prior are well specified, or to the case in which the variational family
The rest of the work discusses in very much detail (more than 100 pages!) the properties of the resulting posterior. The key idea is that with this approach you can not only perform inference with a generic loss (as it was already with the loss-based approach by in Eq. \eqref{Eq:bissiri}), but also choose which properties of the prior to consider by picking the right
They also provide an axiomatic derivation of the RoT; interestingly, among the required axioms is a generalized likelihood principle, which states that all information on
I have gone up the ladder of generality, from standard Bayes' posterior to the General Variational Inference formulation; at each step, some features of standard Bayes' posterior are lost and some others are retained, but a larger set of tools become available. Those may work better for instance with misspecified models, or in case you do not even want to specify a full probabilistic model, or finally if a full Bayesian analysis is too costly and you want to instead resort to a variational approach.
The following Venn's diagram represents the relation between the different techniques presented here:
I have followed here a possible generalization route, but that is by no means the only possibility and my overview does not include all methods which are a superset of standard Bayes' posterior - I have for instance excluded the approach taken in which puts the focus on the predictive distribution rather than the likelihood, or the one in , which uses PAC-Bayesian bounds to define distributions which do not concentrate on one single parameter value in the limit of infinite data if the model is misspecified, thus obtaining better predictive performance.
Still, I think the approaches reviewed here are thought-provoking and allow to see Bayesian inference under a different light. I also feel they bring the original axiomatic theory closer to a pragmatical toolbox.
Of course, I do not think now that standard Bayesian inference has to be thrown to the garbage; there is no need to list the practical results that have been possible with it. Still, it is good to know there are ways around (or beyond) in case a standard Bayesian analysis doesn't work. Additionally, some of the techniques developed to work with standard posteriors (for instance MCMC and VI) can be ported to generalized posteriors.
I am sure there will be other extensions proposed in the future, so keep an eye out if you are interested!