-
Notifications
You must be signed in to change notification settings - Fork 0
/
fit.tex
382 lines (352 loc) · 14.5 KB
/
fit.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
\chapter{Fit of histograms}
\label{ch:fit}
This chapter illustrates how the fits were done in \autoref{ch:anal}. They will
turn out to be almost exactly a standard least squares fit of a histogram, but
we will give a Bayesian interpretation to all the procedures. If you do not
favor Bayesian methods, just do not think about it. To actually do the fits, we
used the Python package \texttt{lsqfit} \cite{lepage2021}, which we warmly
recommend.
\section{Bayesian least squares}
Bayesian inference is based on applying Bayes' theorem to get the probability
distribution of the parameters conditional on the observed data. Call $\theta$
the parameters and $y$ the data, then it reads
%
\begin{equation}
p(\theta|y) = \frac {p(y|\theta) p(\theta)} {p(y)}.
\end{equation}
The term $p(y|\theta)$ is the likelihood, how the data depends on the
parameters. $p(y)$ is the unconditional probability of the data; whatever that
means, we can compute it with $\int\mathrm d\theta\,p(y|\theta)p(\theta)$.
$p(\theta)$ is the unconditional probability distribution of the parameters. It
is called \emph{prior} because it encodes what one knows about the parameters
before having the data. By the same logic, $p(\theta|y)$ is called
\emph{posterior}.
Least squares is a general procedure to produce an estimator, but a standard
interpretation is also as a maximum likelihood fit when the likelihood is
Gaussian. We can do the same in the Bayesian case if the prior is Gaussian too.
We switch to logarithms and drop addends which do not depend on the parameters:
%
\begin{align}
\log p(\mathbf y|\boldsymbol\theta) &=
-\sum_{i=1}^n \log\sigma_i
-\frac 12 \sum_{i=1}^n
\left( \frac {y_i - \mu_i(\boldsymbol\theta)} {\sigma_i} \right)^2, \\
%
\log p(\boldsymbol\theta) &=
-\sum_{j=1}^k \log\sigma^\theta_j
-\frac 12 \sum_{j=1}^k
\left( \frac {\mu^\theta_j - \theta_j} {\sigma^\theta_j} \right)^2.
\end{align}
Now these two terms have to be summed to obtain the log posterior. If we extend
the $\mathbf y$, $\boldsymbol\mu$ and~$\boldsymbol\sigma$ vectors, we can see
the two sums together as a single sum of squares:
%
\begin{align}
y_{n+j} &\equiv \mu^\theta_j, \\
%
\mu_{n+j}(\boldsymbol\theta) &\equiv \theta_j, \\
%
\sigma_{n+j} &\equiv \sigma^\theta_j, \qquad j = 1, \ldots, k, \\
%
\log p(\boldsymbol\theta|\mathbf y) &=
-\sum_{i=1}^{n+k} \log\sigma_i
- \frac 12 \sum_{i=1}^{n+k}
\left( \frac {y_i - \mu_i(\boldsymbol\theta)} {\sigma_i} \right)^2 \equiv \\
%
&\equiv -\sum_{i=1}^{n+k} \log\sigma_i - \frac12 \chi^2. \label{eq:chi2}
\end{align}
So formally the prior is equivalent to additional datapoints, one for each
parameter. We have not dropped the sum with the $\log\sigma$ although it does
not depend on $\boldsymbol\theta$; it will be needed later. Note that the
degrees of freedom of the $\chi^2$ now are $(n+k)-k = n$, instead of the usual
$n-k$ without prior.
In the analysis we said we put uniform priors over the parameters defined in
the interval $(0,1)$. This is implemented by fitting a transformed parameter
with a standard Gaussian prior with zero mean and unitary variance, and
applying as inverse transformation the Gaussian cumulative density function
(cdf). Say, if we want to put a uniform prior on $\theta_j$, we fit the
transformed parameter $\theta_j'$, with
%
\begin{align}
\mu^{\theta'}_j &= 0, \\
%
\sigma^{\theta'}_j &= 1, \\
%
\theta_j(\theta'_j) &= \Phi(\theta'_j)
\equiv \int_{-\infty}^{\theta'_j} \mathrm du\,
\frac {e^{-u^2/2}} {\sqrt{2\pi}}.
\end{align}
Bayesianly, this is equivalent to omitting the squared term for the prior and
putting bounds on the parameter value, without transforming. However it is
convenient for the fitting routine. By the same logic, when we use the
logarithm to map a positive parameter to the real line, the prior on the
untransformed parameter is a log-Gaussian distribution.
Now the least squares estimator would be obtained by minimizing the $\chi^2$,
while a Bayesian cares about the posterior as a distribution. If we approximate
the posterior as a Gaussian, we can do formally the same calculations. To
compute the Gaussian approximation we expand the logarithm of the distribution
to second order around the maximum:
%
\begin{align}
\log p(\boldsymbol\theta|\mathbf y) &\approx
-\frac 12 (\boldsymbol\theta-\boldsymbol\theta_0)^\top
V^{-1} (\boldsymbol\theta-\boldsymbol\theta_0), \\
%
\boldsymbol\theta_0
&= \operatorname*{arg\,min}_{\boldsymbol\theta} \chi^2, \\
%
V^{-1}_{jk} &= \frac12 \left.
\frac {\partial^2 \chi^2} {\partial\theta_j \partial\theta_k}
\right|_{\boldsymbol\theta_0}. \label{eq:lsqpm}
\end{align}
Now let us carry further the calculation of the precision matrix, to discuss
an additional issue:
%
\begin{align}
\frac{\partial\chi^2}{\partial\theta_j} &=
2 \sum_i \frac {y_i - \mu_i(\boldsymbol\theta)} {\sigma_i^2}
\frac {\partial\mu_i} {\partial\theta_j}, \\
%
\frac {\partial^2 \chi^2} {\partial\theta_j \partial\theta_k} &=
2 \sum_i \frac1{\sigma_i^2} \left(
-\frac {\partial\mu_i} {\partial\theta_j}
\frac {\partial\mu_i} {\partial\theta_k}
+ (y_i - \mu_i(\boldsymbol\theta))
\frac {\partial^2 \mu_i} {\partial\theta_j \partial\theta_k}
\right). \label{eq:lsqhess}
\end{align}
In standard least squares, the precision matrix of the estimator is estimated
with the above expression, but with the term with the second derivative
dropped. The reason is for convenience, but also because the expectation of the
residuals $y_i - \mu_i(\boldsymbol\theta_0(\mathbf y))$ is approximately zero.
A frequentist would like to compute this with the true value of the parameters,
but has to content herself with the least squares estimate, getting something
which on average is the true precision matrix. So any term which is zero on
average can be removed.
For a Bayesian, instead, there is not some ideal error corresponding to an
ideal parameter value; there is only the covariance of the posterior obtained
with the particular values at hand. Different data, different posterior.
However in our fits the residuals term is missing, as we said just because this
is convenient for the implementation. In practice, the difference should be
small.
\section{Least squares fit of a histogram}
The standard model of a histogram is Poisson distributions for the bin counts.
Let $c_i$ be the count in bin~$i$, $f_i(\boldsymbol\theta)$ the model
normalized distribution of the bins, and $M$ a parameter for the mean of the
total (Poisson) number of samples. Then the likelihood is
%
\begin{equation}
P(\mathbf c|M,\boldsymbol\theta) =
\prod_i P(c_i|M,\boldsymbol\theta) =
\prod_i \operatorname{poisson}(c_i;Mf_i(\boldsymbol\theta)),
\end{equation}
%
where $\operatorname{poisson}(\cdot;\mu)$ is a Poisson probability mass
function (pmf) with mean~$\mu$.
We do not care about $M$, only about $\boldsymbol\theta$, so we marginalize
the posterior over $M$:
%
\begin{align}
p(\boldsymbol\theta|\mathbf c)
&= \int \mathrm dM\, p(M,\boldsymbol\theta|\mathbf c) = \\
%
&= \int \mathrm dM\,
\frac {P(\mathbf c|M,\boldsymbol\theta) p(M,\boldsymbol\theta)}
{P(\mathbf c)}.
\end{align}
Now we assume that a priori $M$ is independent from $\boldsymbol\theta$, i.e.,
$p(M,\boldsymbol\theta) = p(M)p(\boldsymbol\theta)$. Then, continuing the
calculation:
%
\begin{align}
p(\boldsymbol\theta|\mathbf c)
&= \frac {p(\boldsymbol\theta)} {P(\mathbf c)}
\int \mathrm dM\, p(M)
\prod_i \operatorname{poisson}(c_i;Mf_i(\boldsymbol\theta)) = \\
%
&= \frac {p(\boldsymbol\theta)} {P(\mathbf c)}
\int \mathrm dM\, p(M)
\prod_i
e^{-Mf_i(\boldsymbol\theta)}
\frac {(Mf_i(\boldsymbol\theta))^{c_i}} {c_i!} = \\
%
&= \frac {p(\boldsymbol\theta)} {P(\mathbf c)}
\int \mathrm dM\, p(M)
e^{-M \sum_i f_i(\boldsymbol\theta)}
M^{\sum_i c_i}
\prod_i
\frac {f_i(\boldsymbol\theta)^{c_i}} {c_i!}
\end{align}
%
Since $f_i$ is normalized the sum over it yields~1, while we call $N$ the total
count, so
%
\begin{align}
p(\boldsymbol\theta|\mathbf c)
&= \frac {p(\boldsymbol\theta)} {P(\mathbf c)}
\prod_i
\frac {f_i(\boldsymbol\theta)^c_i} {c_i!}
\int \mathrm dM\, p(M) e^{-M} M^N.
\end{align}
%
The integral does not depend on $\boldsymbol\theta$, so we call it $I(N)$ and
do not care about its value. Now we take the logarithm:
%
\begin{align}
\log p(\boldsymbol\theta|\mathbf c) &=
-\log P(\mathbf c) + \log I(N) - \sum_i \log c_i!
+ \log p(\boldsymbol\theta) + \sum_i c_i \log f_i(\boldsymbol\theta).
\end{align}
This is the exact expression for the posterior, which we want to compare to
least squares, which yields
%
\begin{align}
Q(\boldsymbol\theta) &=
\log p(\boldsymbol\theta)
-\frac 12
\sum_i \frac {(c_i - N f_i(\boldsymbol\theta))^2} {c_i}.
\end{align}
Note the particular flavor of chisquare we have picked: there is no parameter
for the normalization, which is fixed to the actual total count $N$, and the
denominator is the observed count $c_i$.
To do the comparison, we expand both expressions to second order, ignoring the
shared prior term which is already matched:
%
\begin{align}
L(\boldsymbol\theta) &\equiv \log p(\boldsymbol\theta|\mathbf c), \\
%
\frac {\partial L} {\partial\theta_j} &=
\sum_i \frac {c_i} {f_i(\boldsymbol\theta)}
\frac {\partial f_i} {\partial \theta_j}, \\
%
\frac {\partial^2 L} {\partial\theta_j \partial\theta_k} &=
\sum_i c_i \left[
-\frac 1 {f_i(\boldsymbol\theta)^2}
\frac {\partial f_i} {\partial \theta_j}
\frac {\partial f_i} {\partial \theta_k}
+ \frac 1 {f_i(\boldsymbol\theta)}
\frac {\partial^2 f_i} {\partial\theta_j \partial\theta_k}
\right].
\end{align}
For $Q$ we start from \eqref{eq:lsqhess}, obtaining
%
\begin{align}
\frac {\partial^2 Q} {\partial\theta_j \partial\theta_k} &=
-N^2 \sum_i
\frac 1 {c_i} \left[
\frac {\partial f_i} {\partial \theta_j}
\frac {\partial f_i} {\partial \theta_k}
- \left( \frac{c_i} N - f_i(\boldsymbol\theta) \right)
\frac {\partial^2 f_i} {\partial\theta_j \partial\theta_k}
\right]. \label{eq:Qjk}
\end{align}
We do not discuss the different results for the maximum because it is well
known that when there at least five counts per bin the least squares
approximation is good enough. The covariance instead needs a discussion. Now we
manipulate $\partial_{jk} L$ to show it is similar to $\partial_{jk} Q$. First,
we approximate the observed counts with the model, $c_i \approx N
f_i(\boldsymbol\theta)$:
%
\begin{align}
\frac {\partial^2 L} {\partial\theta_j \partial\theta_k} &=
-N^2 \sum_i \frac {c_i} {N^2 f_i(\boldsymbol\theta)^2} \left[
\frac {\partial f_i} {\partial \theta_j}
\frac {\partial f_i} {\partial \theta_k}
- f_i(\boldsymbol\theta)
\frac {\partial^2 f_i} {\partial\theta_j \partial\theta_k}
\right] \approx \\
%
&\approx
-N^2 \sum_i \frac 1 {c_i} \left[
\frac {\partial f_i} {\partial \theta_j}
\frac {\partial f_i} {\partial \theta_k}
- f_i(\boldsymbol\theta)
\frac {\partial^2 f_i} {\partial\theta_j \partial\theta_k}
\right].
\end{align}
Next, we observe that since $f$ is normalized, $\sum_i \partial_{jk} f_i = 0$,
so we can add the missing term to obtain the residuals:
%
\begin{align}
\frac {\partial^2 L} {\partial\theta_j \partial\theta_k} &\approx
-N^2 \sum_i \frac 1 {c_i} \left[
\frac {\partial f_i} {\partial \theta_j}
\frac {\partial f_i} {\partial \theta_k}
+ \left( \frac {c_i} N - f_i(\boldsymbol\theta) \right)
\frac {\partial^2 f_i} {\partial\theta_j \partial\theta_k}
\right]. \label{eq:Ljk}
\end{align}
Comparing \eqref{eq:Qjk} with \eqref{eq:Ljk}, we see that the difference is the
sign of the residuals term. As we explained above, in the least squares fit
that term is dropped altogether, so notwithstanding the Poisson model, we end
up doing the same kind of approximation we did starting from Gaussians.
\section{The $\chi^2/\mathrm{dof}$ correction}
Starting from the Gaussian model posterior \eqref{eq:chi2}, we add a parameter
$s$ that rescales uniformly all the errors:
%
\begin{align}
\log p(\boldsymbol\theta|\mathbf y)
&= -\sum_{i=1}^{n+k} \log\sigma_i - \frac12 \chi^2 \mapsto \\
%
\mapsto \log p(\boldsymbol\theta,s|\mathbf y)
&= \log p(s) -\sum_{i=1}^{n+k} \log(s \sigma_i)
- \frac 12 \sum_{i=1}^{n+k}
\left( \frac {y_i - \mu_i(\boldsymbol\theta)} {s \sigma_i} \right)^2 = \\
%
&= \log p(s) - (n+k) \log s - \sum_{i=1}^{n+k} \log\sigma_i
- \frac 12 \frac{\chi^2}{s^2}.
\end{align}
Since $s$ is a scale parameter, it makes sense to use a uniform prior over
$\log s$, i.e., $p(s) = 1/s$, because it is scale-invariant. Now that we have
let errors vary, we marginalize away $s$, freely dropping normalizations as
usual:
%
\begin{align}
p(\boldsymbol\theta|\mathbf y)
&= \int \mathrm ds\,
p(\boldsymbol\theta,s|\mathbf y) = \\
%
&= \int \frac {\mathrm ds} {s^{n+k+1}}
\exp\left(-\frac12 \frac {\chi^2} {s^2}\right) = \\
%
&= \frac 1 {(\chi^2)^{(n+k)/2}}.
\end{align}
Now again we take the logarithm and expand to second order:
%
\begin{align}
P(\boldsymbol\theta) &\equiv
\log p(\boldsymbol\theta|\mathbf y)
= -\frac{n+k}2 \log \chi^2, \\
%
\frac {\partial P} {\partial \theta_j}
&= -\frac{n+k}2 \frac 1{\chi^2}
\frac {\partial \chi^2} {\partial \theta_j}, \\
%
\frac {\partial^2 P} {\partial\theta_j \partial\theta_k}
&= -\frac{n+k}2 \left[
-\frac 1 {(\chi^2)^2}
\frac {\partial \chi^2} {\partial \theta_j}
\frac {\partial \chi^2} {\partial \theta_k}
+ \frac 1{\chi^2}
\frac {\partial^2 \chi^2} {\partial\theta_j \partial\theta_k}
\right]. \label{eq:Pjk}
\end{align}
In the maximum $\partial_j P = 0$, so the first term in \eqref{eq:Pjk}
vanishes, leaving
%
\begin{align}
\frac {\partial^2 P} {\partial\theta_j \partial\theta_k}
&= -\frac12 \frac 1 {\chi^2/(n+k)}
\frac {\partial^2 \chi^2} {\partial\theta_j \partial\theta_k}.
\label{eq:chi2dof}
\end{align}
%
This is to be compared with the precision matrix in \eqref{eq:lsqpm}. The
only difference is that the matrix has been divided by $\chi^2/(n+k)$. Note
that $\mathrm{dof} = n$, so we obtained a different prescription than the
conventional one. Nevertheless, in the analysis we used $\chi^2/\mathrm{dof}$
just because it is what people expect; compared to \eqref{eq:chi2dof}, this
gives errors up to \SI{25}\% larger on our fits, see \autoref{tab:ct}.
Similarly, we used the pvalue; since we always obtained a stark contrast
between good and bad fits, we did not bother giving a precise Bayesian
equivalent.