-
Notifications
You must be signed in to change notification settings - Fork 2
/
planning.tex
460 lines (378 loc) · 21.3 KB
/
planning.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
\chapter{Validation and experimental planning}
\label{chap:planning}
\glsresetall
% TODO: change the chapter title to "Validation" or something similar
\chapterprecishere{%
All models are wrong, but some are useful.
\par\raggedleft--- \textup{George E. P. Box}, Robustness in Statistics}
Once we have defined what an inductive problem is, we can start to think about how to
solve it in practical terms.
In this chapter, we present the experimental planning one can use in the data-driven
parts of a data science project. \emph{Experimental planning} in the context of data
science involves designing and organizing experiments to gather performance data
systematically in order to reach specific goals or test hypotheses.
The reason we need to plan experiments is that data science is experimental, i.e. we
usually lack a theoretical model that can predict the outcome of a given algorithm on a
given dataset --- in other words, the quality of the solution given a certain approach.
This is why we need to run experiments to gather data and make
inferences from it.
There is not a single way to plan experiments, but there are some common steps that can
be followed to design a good experimental plan. In this chapter, we present a
framework for experimental planning that can be used in most data science projects.
\begin{mainbox}{Chapter remarks}
\boxsubtitle{Contents}
\startcontents[chapters]
\printcontents[chapters]{}{1}{}
\vspace{1em}
\boxsubtitle{Context}
\begin{itemize}
\item \dots
\end{itemize}
\boxsubtitle{Objectives}
\begin{itemize}
\item \dots
\end{itemize}
\boxsubtitle{Takeways}
\begin{itemize}
\item \dots
\end{itemize}
\end{mainbox}
{}
\clearpage
\input{evaluation}
\section{Elements of an experimental plan}
There are important elements that should be considered when designing an experimental
plan. These elements are:
\begin{itemize}
\item \textbf{Hypothesis}: The main question that the experiment aims to validate.
In this chapter, we address common questions in data science projects and how to
validate them.
\item \textbf{Data}: The dataset that will be used in the experiment. In
\cref{chap:data}, we address topics about collecting and organizing data.
\item \textbf{Solution search algorithm}\footnote{We use the term ``search'' because,
the chosen algorithm aim at optimizing both the parameters of the data handling
pipeline and the ones of the model}: Techniques that find a solution for the task.
For us, a task includes both adjusting the appropriate data-handling
pipeline and learning the model. The theoretical basis for these techniques is in
\cref{chap:slt}, and the practical aspects of data handling are in \cref{chap:handling}.
\item \textbf{Performance measure}: The metric that will be used to evaluate the
performance of the model. Refer to \cref{chap:evaluation} for more information.
\end{itemize}
A general example of a description of an experimental plan is ``What is the probability of
the technique $A$ to find a model that reaches a performance $X$ in terms of metric $Y$ in
the real-world given dataset $Z$ as training set (assuming $Z$ is a representative
dataset)?''
Another example is ``Is technique $A$ better than technique $B$ for finding a model that
predicts the output with $D$ as a training set in terms of metric $E$?''
We consider these two cases: \emph{estimating expected performance} and \emph{comparing
algorithms}.
\section{Estimating expected performance}
\label{sec:expected-performance}
When dealing with a data-driven solution, the available data is a representation of the
real world. So, we have to make the best use of the data we have to estimate how good our
solution is expected to be in production.
Obviously, the more data we use to search for a solution, the better the solution is
expected to be. Thus, we use the whole dataset for deploying a solution. But, what
method for preprocessing and learning should we use? How well that technique is
expected to perform in the real world?
Let us say we fix a certain technique, let us call it $A$. Let $M$ be the solution found
by $A$ using the whole dataset $D$. If we assess $M$ using the whole dataset $D$, the
performance $p$ we get is optimistic. This is because $M$ has been trained and tested
on the same data.
\textcolor{red}{Hold out does not solve the problem.}
To estimate better the expected performance of $M$ in production, we can use the following
experimental plan.
\paragraph{Sampling strategy} As any statistical evaluation, we need to generate samples
of the performance of the possible solutions that $A$ is able to obtain. To do so, we use
a sampling strategy to generate datasets $D_1, D_2, \ldots$ from $D$. \textcolor{red}{%
Since we assume the dataset $D$ is a representative sample of the real world, we must
ensure that each sampled dataset has similar characteristics. So handling operators like
balancing the dataset should be applied only afterwards.} Each
dataset is further divided into a training set and a test set, which must be disjoint.
The training set is thus used to find a solution --- $M_1, M_2, \ldots$ for each
training set --- and the test set is used to evaluate the performance --- $p_1, p_2,
\ldots$ for each test set --- of the solution. The test set emulates the real-world
scenario, where the model is used to make predictions on new data.
\paragraph{Solution search algorithm} The solution search algorithm $A$ involves both a
given data handling pipeline and a machine learning method. Both of them generate a
different result for each dataset $D_k$ used as an input. In other words, the parameters
$\phi$ of the data handling operators are adjusted --- see \cref{chap:handling} --- and the
parameters $\theta$ of the machine learning model are adjusted --- see \cref{chap:slt}.
These parameters, $\left[\phi_k, \theta_k\right]$ are the solution $M_i$, and must be
calculated exclusively using the training set $D_{k,\text{train}}$.
\paragraph{Prediction} Once the parameters $\phi$ and $\theta$ are fixed, we apply them
in the test set $D_{k,\text{test}}$. For each sample $(x_i, y_i) \in D_{k,\text{test}}$,
we calculate the prediction $\hat{y} = f_{\phi,\theta}(x_i)$. The target value $y$ is
called the ground-truth or expected outcome.
\paragraph{Performance assessment} Given a performance metric $R$, for each dataset $D_k$,
we calculate
$$p_k = R\!\left(\left[y_i\right]_i, \left[\hat{y}_i\right]_i\right)\text{.}$$
Note that, by definition, $p_k$ is free of \gls{leakage}, as $\left[\phi_k,
\theta_k\right]$ are found without the use of the data in $D_{k,\text{test}}$ and to
calculate $\hat{y}_i$ we use only $x_i$ (with no target $y_i$).
\paragraph{Analysing the results} Finally, we study the sampled performance values $p_1,
p_2, \ldots$ like any other statistical data. We can calculate summary statistics, like
the mean, median, variance, and standard deviation, or visualization techniques, like
box plots or other distribution-based plots. We can also use hypothesis tests to
calculate the probability of the performance of the solution being better than a certain
threshold.
\begin{hlbox}{Evaluation \emph{vs.} validation}
We call evaluation the process of assessing the performance of a solution using a
test set. Validation, on the other hand, is the process of interpreting or confirming
the meaning of the evaluation results. Validation is the process of determining the
degree to which the evaluation results support the intended use of the solution (unseen
data).
\end{hlbox}
\paragraph{Interpretation} The results are not the ``real'' performance of the solution
$M$ in the real world, as that would require new data to be collected. However, we can
safely interpret the performance samples as being sampled from the same distribution as
the performance of the solution $M$.
\begin{figurebox}[label=fig:plan-single]{Experimental plan for estimating expected performance of a solution.}
\centering
\begin{tikzpicture}
\node [darkcircle] (data) at (0, 0) {Data};
\node [block] (sampling) at (0, -2) {Sampling strategy};
\path [line] (data) -- (sampling);
\foreach \i in {1, 2, 4, 5} {
\draw [dashed] (-7 + 2 * \i, -4.5) rectangle (-5.1 + 2 * \i, -3.5);
\path [line] (sampling) -- (-6.1 + 2 * \i, -3.5);
\node [smalldarkblock] (train\i) at (-6.4 + 2 * \i, -4) {Training};
\node [smallblock] (test\i) at (-5.6 + 2 * \i, -4) {Test};
\path [line] (-6.1 + 2 * \i, -4.5) -- (-6.1 + 2 * \i, -5.5);
}
\node [anchor=center] at (0, -4) {\dots};
\draw [dashed] (-5, -5.5) rectangle (4.9, -10.5);
\node [smalldarkblock, font=\small, inner sep=4pt] (train) at (-4, -7) {Training};
\node [smallblock, inner sep=4pt] (test) at (-4, -9) {Test (no target)};
\draw [dashed] (-3, -6) rectangle (3, -8);
\node [anchor=south] at (0, -6.1) {Solution search algorithm};
\node [block] (handling) at (-1.5, -7) {Data handling pipeline};
\node [block] (learning) at (1.5, -7) {Machine learning};
\node (model) at (4, -7) {%
% bracket array with \theta and \phi
$\left[
\begin{array}{c}
\phi \\
\theta \\
\end{array}
\right]$
};
\path [line] (train) -- (handling);
\path [line] (handling) -- (learning);
\path [line, dashed] (3, -7) -- (model);
\node [block] (preprocess) at (-1.5, -9) {Preprocessor};
\node [block] (prediction) at (1.5, -9) {Model};
\path [line, dashed] (handling) -- (preprocess);
\path [line, dashed] (learning) -- (prediction);
\path [line] (test) -- (preprocess);
\path [line] (preprocess) -- (prediction);
\node [smallblock, inner sep=4pt] (predicted) at (4, -9) {predictions};
\node (performance) at (4, -10) {$p$};
\path [line] (prediction) -- (predicted);
\node [smallblock, inner sep=4pt] (labels) at (-4, -10) {Test (target)};
\path [line] (labels) -- (performance);
\path [line] (predicted) -- (performance);
\node (perfs) at (-4.2, -12) {%
$\left[
\begin{array}{c}
p_1 \\
p_2 \\
\vdots \\
\end{array}
\right]$
};
\node [block] (visualization) at (-2, -12) {Visualization};
\node [block] (summary) at (0.5, -12) {Summary statistics};
\node [block] (hypothesis) at (3, -12) {Hypothesis test};
\path [line, dashed] (-4.2, -10.5) -- (perfs);
\path [line] (perfs) -- (-4.2, -13) -- (-2, -13) -- (visualization);
\path [line] (-2, -13) -- (0.5, -13) -- (summary);
\path [line] (0.5, -13) -- (3, -13) -- (hypothesis);
\end{tikzpicture}
\tcblower
% A brief description of the experimental plan.
The experimental plan for estimating the expected performance of a solution involves
sampling the data, training and testing the solution, evaluating the performance, and
validating the results.
\end{figurebox}
A summary of the experimental plan for estimating expected performance is shown in
\cref{fig:plan-single}.
\clearpage
\subsection{Cross validation}
The most common sampling strategy is the \emph{cross-validation}. It assumes that data is
independent and identically distributed (i.i.d.). The cross-validation technique divides
the dataset into $r$ folds randomly, with the same size. Each part (fold) is used as a
test set once and as a training set $r-1$ times. So, first we use as training set folds
$2, 3, \ldots, r$ and as test set fold $1$. Then, we use as training set folds $1, 3,
\ldots, r$ and as test set fold $2$. And so on. (See \cref{fig:cross-validation}.)
\begin{figurebox}[label=fig:cross-validation]{Cross-validation}
\centering
\begin{tikzpicture}
\foreach \i in {1, 2, 3, 4} {
\node at (2 * \i, 0) {Fold \i};
\foreach \j in {1, 2, 3, 4} {
% if \i == \j, then it is the test set
\ifnum\i=\j
\node [smallblock, minimum width=18mm] (fold\i\j) at (2 * \i, -\j) {Test};
\else
\node [smalldarkblock, minimum width=18mm] (fold\i\j) at (2 * \i, -\j) {Training};
\fi
}
}
\end{tikzpicture}
\tcblower
Cross-validation is a technique to sample training and test sets. It divides the
dataset into $r$ folds, using $r-1$ folds as a training set and the remaining fold as a
test set.
\end{figurebox}
If possible, one should use repeated
cross-validation, where this process is repeated many times. Also, when dealing with
classification problems, we should use stratified cross-validation, where the distribution
of the classes is preserved in each fold.
Also, the application of a single cross-validation sampling enables us to create a
predicted vector for the whole dataset. This is done by concatenating the predictions for
each fold. (Note however that the predictions are not totally independent, as they share
some training data. This dependency should be taken into account when analyzing the
results.) This vector can be used to perform hypothesis tests --- like McNemar's test, see
\cref{sub:comparison} --- or to plot ROC (Receiver Operating Characteristic) curves or DET
(Detection Error Tradeoff) curves --- see \cref{chap:evaluation}.
\subsection{Validation methods}
% Talk about summary statistics, visualization (boxplot, roc and det curves), and Bayesian
% analysis.
Validation is the process of interpreting or confirming the meaning of the evaluation
results. It is the process of determining the degree to which the evaluation results
support the intended use of the solution.
Sometimes, it is as simple as calculating the mean and standard deviation of the
performance samples. Other times, we need to use more sophisticated techniques, like
hypothesis tests or Bayesian analysis.
Let us say our goal is to reach a certain performance threshold $p_0$. After an
experiments done with $10$ repeated $10$-fold cross-validation, we have the average
performance $\bar{p}$ and the standard deviation $\sigma$. If $\bar{p} - \sigma \gg
p_0$, it is very likely that the solution will reach the threshold in production.
Although this is not a formal validation, it is a good and likely enough indication.
Also, it is common to use visualization techniques to analyze the results. Box plots are
a good way to see the distribution of the performance samples.
A more sophisticated technique is to use Bayesian analysis. In this case, we use the
performance samples to estimate the probability distribution of the performance of the
algorithm. This distribution can be used to calculate the probability of the performance
being better than a certain threshold.
\textcite{Benavoli2017} propose an interesting Bayesian test that accounts for the
overlapping training sets in the cross-validation\footnote{%
This is actually a particular case of the proposal in the paper, where the authors
consider the comparison between two performance vector --- which is the case described in
\cref{sub:comparison}.}.
Let $z_k = p_k - p^{*}$ be the
difference between the performance of the $k$-th fold and the performance goal $p^{*}$,
a generative model for the data is
\begin{equation*}
\vec{z} = \vec{1}\mu + \vec{v}\text{,}
\end{equation*}
where $\vec{z} = (z_1, z_2, \ldots, z_n)$ is the vector performance gains, $\vec{1}$ is a
vector of ones, $\mu$ is the parameter of interest (the mean performance gain), and
$\vec{v} \sim \operatorname{MVN}(0, \Sigma)$ is a multivariate normal noise with zero mean
and covariance matrix $\Sigma$. The covariance matrix $\Sigma$ is characterized as
\begin{equation*}
\Sigma_{ii} = \sigma^2\text{,}\quad
\Sigma_{ij} = \sigma^2\rho\text{,}
\end{equation*}
for all $i \neq j \in \{1, 2, \ldots, n\}$, where $\rho$ is the correlation (between folds)
and $\sigma^2$ is the variance. The likelihood model of the data is
\begin{equation*}
\Prob(\vec{z} \mid \mu, \Sigma) =
\exp\left(-\frac{1}{2}(\vec{z} - \vec{1}\mu)^T \Sigma^{-1} (\vec{z} - \vec{1}\mu)\right)
\frac{1}{(2\pi)^{n/2} \sqrt{\lvert \Sigma \rvert}}\text{.}
\end{equation*}
According to them, such likelihood does not allow to estimate the correlation from data,
as the maximum likelihood estimate of $\rho$ is zero regardless of the observations.
Since $\rho$ is not identifiable, the authors suggest using the heuristic where $\rho$ is
the ratio between the number of folds and the total number of performance samples.
To estimate the probability of the performance of the solution being greater than the
threshold, we first estimate the parameters $\mu$ and $\nu = \sigma^{-2}$ of the
generative model. \textcite{Benavoli2017} consider the prior
\begin{equation*}
\Prob(\mu, \nu \mid \mu_0, \kappa_0, a, b) = \operatorname{NG}(\mu, \nu; \mu_0, \kappa_0, a, b)\text{,}
\end{equation*}
that is a Normal-Gamma distribution with parameters $(\mu_0, \kappa_0, a, b)$. This is a
conjugate prior to the likelihood model. Choosing the prior parameters $\mu_0 = 0$,
$\kappa_0 \to \infty$, $a = -1/2$, and $b = 0$, the posterior distribution of $\mu$ is a
location-scale Student distribution. Mathematically, we have
\begin{equation*}
\Prob(\mu \mid \vec{z}, \mu_0, \kappa_0, a, b) =
\operatorname{St}(\mu; n - 1, \bar{z}, \left(
\frac{1}{n} + \frac{\rho}{1 - \rho}
\right)s^2)\text{,}
\end{equation*}
where
\begin{equation*}
\bar{z} = \frac{1}{n} \sum_{i=1}^n z_i\text{,}
\end{equation*}
and
\begin{equation*}
s^2 = \frac{1}{n - 1} \sum_{i=1}^{n-1} (z_i - \bar{z})^2\text{.}
\end{equation*}
Thus, validating that the solution obtained by the algorithm in production will surpass
the threshold $p^{*}$ consists of calculating the probability
\begin{equation*}
\Prob(\mu > 0 \mid \vec{z}) > \gamma\text{,}
\end{equation*}
where $\gamma$ is the confidence level.
Note that the Bayesian analysis is a more sophisticated technique than the null hypothesis
significance testing, as it allows us to estimate the probability of the performance of
the solution being better than a certain threshold.
Also, be aware that the choice of the model and the prior distribution can affect the
results. \textcite{Benavoli2017} suggest using 10 repetitions of 10-fold cross-validation
to estimate the parameters of the generative model. They also show experimental evidence
that their choices are robust to the choice of the prior distribution. However, one
should be aware of the limitations of the model.
\section{Comparing strategies}
\label{sub:comparison}
\textcolor{red}{Talk about paired dataset samples.}
When we have two or more strategies to solve a problem, we need to compare them to see
which one is better. This is a common situation in data science projects, as we usually
have many techniques to solve a problem.
One way to look at this problem is to consider that the algorithm\footnote{That includes
both data handling and machine learning.} $A$ has \emph{hyperparameters} $\lambda \in
\Lambda$. A hyperparameter here is a parameter that is not learned by the algorithm, but
is set by the user. For example, the number of neighbors in a k-NN algorithm is a
hyperparameter. For the sake of generality, we can consider that the hyperparameters may
also include different learning algorithms or data handling pipelines.
Let us say we have a baseline algorithm $A(\lambda_0)$ --- for instance, something that is
in production, the result of the last sprint or a well-known algorithm --- and a new candidate algorithm $A(\lambda)$.
Suppose $\vec{p}(\lambda_0)$ and $\vec{p}(\lambda)$ are the performance vectors of the
baseline and the candidate algorithms, respectively, that are calculated using the same
strategy described in \cref{sec:expected-performance}. We can validate whether the
candidate is better than the baseline by
\begin{equation*}
\Prob(\mu > 0 \mid \vec{z}) > \gamma\text{,}
\end{equation*}
where $\vec{z}$ is now $\vec{p}(\lambda) - \vec{p}(\lambda_0)$ and $\gamma$ is the confidence
level. Also, the expected performance gain of the candidate algorithm is $\mu$ --- if
negative, the performance loss.
This strategy can be applied iteratively to compare many algorithms. For example, we can
compare $A(\lambda_1)$ with $A(\lambda_0)$, $A(\lambda_2)$ with $A(\lambda_1)$, and so on,
keeping the best algorithm found so far as the baseline. In the cases where the confidence
level is not reached, but the expected performance gain is positive, we can consider
additional characteristics of the algorithms, like the interpretability of the model, the
computational cost, or the ease of implementation, to decide which one is better. However,
one should pay attention whether the probability
\begin{equation*}
\Prob(\mu < 0 \mid \vec{z})
\end{equation*}
is too high or not. Always ask yourself if the risk of the performance loss is worth in
the real-world scenario.
\subsection{About nesting experiments}
Mathematically speaking, there is no difference between assessing the choice of
$\left[\phi, \theta\right]$ and the choice of $\lambda$. Thus, some techniques --- like
grid search --- can be used to find the best hyperparameters using a nested experimental
plan.
The idea is the same, we assess how good is the expected choice of the
hyperparameter-optimization technique $B$ to find the appropriate hyperparameters. Similarly,
the choice of the hyperparameters and the parameters that goes to production is the
application of $B$ to the whole dataset. However, never use the choices of the
hyperparameters in the experimental plan to make decisions about what goes to production.
(The same is true for the parameters $\left[\phi, \theta\right]$ in the traditional case.)
\textcolor{red}{We can unnest the search by interpreting the options as different
algorithms two by two.}
\section{Grouping}
TODO Grouped cross validation.
% vim: spell spelllang=en