%load_ext d2lbook.tab
tab.interact_select(['mxnet', 'pytorch', 'tensorflow', 'jax'])
🏷️sec_prob
One way or another, machine learning is all about uncertainty. In supervised learning, we want to predict something unknown (the target) given something known (the features). Depending on our objective, we might attempt to predict the most likely value of the target. Or we might predict the value with the smallest expected distance from the target. And sometimes we wish not only to predict a specific value but to quantify our uncertainty. For example, given some features describing a patient, we might want to know how likely they are to suffer a heart attack in the next year. In unsupervised learning, we often care about uncertainty. To determine whether a set of measurements are anomalous, it helps to know how likely one is to observe values in a population of interest. Moreover, in reinforcement learning, we wish to develop agents that act intelligently in various environments. This requires reasoning about how an environment might be expected to change and what rewards one might expect to encounter in response to each of the available actions.
Probability is the mathematical field concerned with reasoning under uncertainty. Given a probabilistic model of some process, we can reason about the likelihood of various events. The use of probabilities to describe the frequencies of repeatable events (like coin tosses) is fairly uncontroversial. In fact, frequentist scholars adhere to an interpretation of probability that applies only to such repeatable events. By contrast Bayesian scholars use the language of probability more broadly to formalize our reasoning under uncertainty. Bayesian probability is characterized by two unique features: (i) assigning degrees of belief to non-repeatable events, e.g., what is the probability that the moon is made out of cheese?; and (ii) subjectivity---while Bayesian probability provides unambiguous rules for how one should update their beliefs in light of new evidence, it allows for different individuals to start off with different prior beliefs. Statistics helps us to reason backwards, starting off with collection and organization of data and backing out to what inferences we might draw about the process that generated the data. Whenever we analyze a dataset, hunting for patterns that we hope might characterize a broader population, we are employing statistical thinking. Most courses, majors, theses, careers, departments, companies, and institutions have been devoted to the study of probability and statistics. While this section only scratches the surface, we will provide the foundation that you need to begin building models.
%%tab mxnet
%matplotlib inline
from d2l import mxnet as d2l
from mxnet import np, npx
from mxnet.numpy.random import multinomial
import random
npx.set_np()
%%tab pytorch
%matplotlib inline
from d2l import torch as d2l
import random
import torch
from torch.distributions.multinomial import Multinomial
%%tab tensorflow
%matplotlib inline
from d2l import tensorflow as d2l
import random
import tensorflow as tf
from tensorflow_probability import distributions as tfd
%%tab jax
%matplotlib inline
from d2l import jax as d2l
import random
import jax
from jax import numpy as jnp
import numpy as np
Imagine that we plan to toss a coin
and want to quantify how likely
we are to see heads (vs. tails).
If the coin is fair,
then both outcomes
(heads and tails),
are equally likely.
Moreover if we plan to toss the coin
Formally, the quantity
Suppose that we stumbled upon a real coin
for which we did not know
the true
Now, suppose that the coin was in fact fair,
i.e., random.random
yields numbers in the interval 0
and 1
with probability 0.5
each
by testing whether the returned float is greater than 0.5
%%tab all
num_tosses = 100
heads = sum([random.random() > 0.5 for _ in range(100)])
tails = num_tosses - heads
print("heads, tails: ", [heads, tails])
More generally, we can simulate multiple draws
from any variable with a finite number
of possible outcomes
(like the toss of a coin or roll of a die)
by calling the multinomial function,
setting the first argument
to the number of draws
and the second as a list of probabilities
associated with each of the possible outcomes.
To simulate ten tosses of a fair coin,
we assign probability vector [0.5, 0.5]
,
interpreting index 0 as heads
and index 1 as tails.
The function returns a vector
with length equal to the number
of possible outcomes (here, 2),
where the first component tells us
the number of occurrences of heads
and the second component tells us
the number of occurrences of tails.
%%tab mxnet
fair_probs = [0.5, 0.5]
multinomial(100, fair_probs)
%%tab pytorch
fair_probs = torch.tensor([0.5, 0.5])
Multinomial(100, fair_probs).sample()
%%tab tensorflow
fair_probs = tf.ones(2) / 2
tfd.Multinomial(100, fair_probs).sample()
%%tab jax
fair_probs = [0.5, 0.5]
# jax.random does not have multinomial distribution implemented
np.random.multinomial(100, fair_probs)
Each time you run this sampling process,
you will receive a new random value
that may differ from the previous outcome.
Dividing by the number of tosses
gives us the frequency
of each outcome in our data.
Note that these frequencies,
like the probabilities
that they are intended
to estimate, sum to
%%tab mxnet
multinomial(100, fair_probs) / 100
%%tab pytorch
Multinomial(100, fair_probs).sample() / 100
%%tab tensorflow
tfd.Multinomial(100, fair_probs).sample() / 100
%%tab jax
np.random.multinomial(100, fair_probs) / 100
Here, even though our simulated coin is fair
(we set the probabilities [0.5, 0.5]
ourselves),
the counts of heads and tails may not be identical.
That is because we only drew a finite number of samples.
If we did not implement the simulation ourselves,
and only saw the outcome,
how would we know if the coin were slightly unfair
or if the possible deviation from 10000
tosses.
%%tab mxnet
counts = multinomial(10000, fair_probs).astype(np.float32)
counts / 10000
%%tab pytorch
counts = Multinomial(10000, fair_probs).sample()
counts / 10000
%%tab tensorflow
counts = tfd.Multinomial(10000, fair_probs).sample()
counts / 10000
%%tab jax
counts = np.random.multinomial(10000, fair_probs).astype(np.float32)
counts / 10000
In general, for averages of repeated events (like coin tosses),
as the number of repetitions grows,
our estimates are guaranteed to converge
to the true underlying probabilities.
The mathematical proof of this phenomenon
is called the law of large numbers
and the central limit theorem
tells us that in many situations,
as the sample size 1
to 10000
.
%%tab pytorch
counts = Multinomial(1, fair_probs).sample((10000,))
cum_counts = counts.cumsum(dim=0)
estimates = cum_counts / cum_counts.sum(dim=1, keepdims=True)
estimates = estimates.numpy()
d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();
%%tab mxnet
counts = multinomial(1, fair_probs, size=10000)
cum_counts = counts.astype(np.float32).cumsum(axis=0)
estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True)
%%tab tensorflow
counts = tfd.Multinomial(1, fair_probs).sample(10000)
cum_counts = tf.cumsum(counts, axis=0)
estimates = cum_counts / tf.reduce_sum(cum_counts, axis=1, keepdims=True)
estimates = estimates.numpy()
%%tab jax
counts = np.random.multinomial(1, fair_probs, size=10000).astype(np.float32)
cum_counts = counts.cumsum(axis=0)
estimates = cum_counts / cum_counts.sum(axis=1, keepdims=True)
%%tab mxnet, tensorflow, jax
d2l.set_figsize((4.5, 3.5))
d2l.plt.plot(estimates[:, 0], label=("P(coin=heads)"))
d2l.plt.plot(estimates[:, 1], label=("P(coin=tails)"))
d2l.plt.axhline(y=0.5, color='black', linestyle='dashed')
d2l.plt.gca().set_xlabel('Samples')
d2l.plt.gca().set_ylabel('Estimated probability')
d2l.plt.legend();
Each solid curve corresponds to one of the two values of the coin and gives our estimated probability that the coin turns up that value after each group of experiments. The dashed black line gives the true underlying probability. As we get more data by conducting more experiments, the curves converge towards the true probability. You might already begin to see the shape of some of the more advanced questions that preoccupy statisticians: How quickly does this convergence happen? If we had already tested many coins manufactured at the same plant, how might we incorporate this information?
We have already gotten pretty far: posing a probabilistic model, generating synthetic data, running a statistical estimator, empirically assessing convergence, and reporting error metrics (checking the deviation). However, to go much further, we will need to be more precise.
When dealing with randomness,
we denote the set of possible outcomes 5
,
we would say that both
A probability function maps events
onto real values
- The probability of any event
$\mathcal{A}$ is a non-negative real number, i.e.,$P(\mathcal{A}) \geq 0$ ; - The probability of the entire sample space is
$1$ , i.e.,$P(\mathcal{S}) = 1$ ; - For any countable sequence of events
$\mathcal{A}_1, \mathcal{A}_2, \ldots$ that are mutually exclusive ($\mathcal{A}_i \cap \mathcal{A}j = \emptyset$ for all $i \neq j$), the probability that any of them happens is equal to the sum of their individual probabilities, i.e., $P(\bigcup{i=1}^{\infty} \mathcal{A}i) = \sum{i=1}^{\infty} P(\mathcal{A}_i)$.
These axioms of probability theory,
proposed by :citet:Kolmogorov.1933
,
can be applied to rapidly derive a number of important consequences.
For instance, it follows immediately
that the probability of any event
When we spoke about events like the roll of a die
coming up odds or the first coin toss coming up heads,
we were invoking the idea of a random variable.
Formally, random variables are mappings
from an underlying sample space
to a set of (possibly many) values.
You might wonder how a random variable
is different from the sample space,
since both are collections of outcomes.
Importantly, random variables can be much coarser
than the raw sample space.
We can define a binary random variable like "greater than 0.5"
even when the underlying sample space is infinite,
e.g., the line segment between
Every value taken by a random variable corresponds
to a subset of the underlying sample space.
Thus the occurrence where the random variable
Note that there is a subtle difference between discrete random variables, like flips of a coin or tosses of a die, and continuous ones, like the weight and the height of a person sampled at random from the population. In this case we seldom really care about someone's exact height. Moreover, if we took precise enough measurements, we would find that no two people on the planet have the exact same height. In fact, with fine enough measurements, you would never have the same height when you wake up and when you go to sleep. There's little point in asking about the exact probability that someone is 1.801392782910287192 meters tall. Instead, we typically care more about being able to say whether someone's height falls into a given interval, say between 1.79 and 1.81 meters. In these cases we work with probability densities. The height of exactly 1.80 meters has no probability, but nonzero density. To get out the probability assigned to an interval, we must take an integral of the density over that interval.
You might have noticed that we couldn't even make it past the last section without making statements involving interactions among multiple random variables (recall $P(X,Y) = P(X) P(Y)$). Most of machine learning is concerned with such relationships. Here, the sample space would be the population of interest, say customers who transact with a business, photographs on the internet, or proteins known to biologists. Each random variable would represent the (unknown) value of a different attribute. Whenever we sample an individual from the population, we observe a realization of each of the random variables. Because the values taken by random variables correspond to subsets of the sample space that could be overlapping, partially overlapping, or entirely disjoint, knowing the value taken by one random variable can cause us to update our beliefs about what values of another random variable are likely. If a patient walks into a hospital and we observe that they are having trouble breathing and have lost their sense of smell, then we believe that they are more likely to have COVID-19 than we might if they had no trouble breathing and a perfectly ordinary sense of smell.
When working with multiple random variables,
we can construct events corresponding
to every combination of values
that the variables can jointly take.
The probability function that assigns
probabilities to each of these combinations
(e.g.
The ratio
Using the definition of conditional probabilities,
we can derive the famous result called Bayes' theorem.
By construction, we have that
This simple equation has profound implications because
it allows us to reverse the order of conditioning.
If we know how to estimate
Since we know that
In Bayesian statistics, we think of an observer
as possessing some (subjective) prior beliefs
about the plausibility of the available hypotheses
encoded in the prior
Note that
Independence is another fundamentally important concept
that forms the backbone of
many important ideas in statistics.
In short, two variables are independent
if conditioning on the value of
Independence is especially useful when it holds among the successive draws of our data from some underlying distribution (allowing us to make strong statistical conclusions) or when it holds among various variables in our data, allowing us to work with simpler models that encode this independence structure. On the other hand, estimating the dependencies among random variables is often the very aim of learning. We care to estimate the probability of disease given symptoms specifically because we believe that diseases and symptoms are not independent.
Note that because conditional probabilities are proper probabilities,
the concepts of independence and dependence also apply to them.
Two random variables
And conversely, two dependent random variables can become independent upon conditioning on a third. This often happens when two otherwise unrelated events have a common cause. Shoe size and reading level are highly correlated among elementary school students, but this correlation disappears if we condition on age.
🏷️subsec_probability_hiv_app
Let's put our skills to the test.
Assume that a doctor administers an HIV test to a patient.
This test is fairly accurate and it fails only with 1% probability
if the patient is healthy but reporting him as diseased.
Moreover, it never fails to detect HIV if the patient actually has it.
We use
Conditional probability | ||
---|---|---|
1 | 0.01 | |
0 | 0.99 |
Note that the column sums are all 1 (but the row sums do not),
since they are conditional probabilities.
Let's compute the probability of the patient having HIV
if the test comes back positive, i.e.,
This leads us to
In other words, there is only a 13.06% chance that the patient actually has HIV, despite using a very accurate test. As we can see, probability can be counterintuitive. What should a patient do upon receiving such terrifying news? Likely, the patient would ask the physician to administer another test to get clarity. The second test has different characteristics and it is not as good as the first one.
Conditional probability | ||
---|---|---|
0.98 | 0.03 | |
0.02 | 0.97 |
Unfortunately, the second test comes back positive, too. Let's calculate the requisite probabilities to invoke Bayes' theorem by assuming conditional independence:
Now we can apply marginalization to obtain the probability that both tests come back positive:
Finally, the probability of the patient having HIV given both tests being positive is
That is, the second test allowed us to gain much higher confidence that not all is well. Despite the second test being considerably less accurate than the first one, it still significantly improved our estimate. The assumption of both tests being conditional independent of each other was crucial for our ability to generate a more accurate estimate. Take the extreme case where we run the same test twice. In this situation we would expect the same outcome in both times, hence no additional insight is gained from running the same test again. The astute reader might have noticed that the diagnosis behaved like a classifier hiding in plain sight where our ability to decide whether a patient is healthy increases as we obtain more features (test outcomes).
Often, making decisions requires not just looking
at the probabilities assigned to individual events
but composing them together into useful aggregates
that can provide us with guidance.
For example, when random variables take continuous scalar values,
we often care about knowing what value to expect on average.
This quantity is formally called an expectation.
If we are making investments,
the first quantity of interest
might be the return we can expect,
averaging over all the possible outcomes
(and weighting by the appropriate probabilities).
For instance, say that with 50% probability,
an investment might fail altogether,
with 40% probability it might provide a 2$\times$ return,
and with 10% probability it might provide a 10$\times$ return 10$\times$.
To calculate the expected return,
we sum over all returns, multiplying each
by the probability that they will occur.
This yields the expectation
In general, the expectation (or average)
of the random variable
Likewise, for densities we obtain
for discrete probabilities and densities, respectively.
Returning to the investment example from above,
If the utility associated with a total loss were -1,
and the utilities associated with returns of 1, 2, and 10
were 1, 2 and 4, respectively,
then the expected happiness of investing
would be
For financial decisions,
we might also want to measure
how risky an investment is.
Here, we care not just about the expected value
but how much the actual values tend to vary
relative to this value.
Note that we cannot just take
the expectation of the difference
between the actual and expected values.
That is because the expectation of a difference
is the difference of the expectations,
and thus
Here the equality follows by expanding
Lastly, the variance of a function of a random variable is defined analogously as
$$\mathrm{Var}{x \sim P}[f(x)] = E{x \sim P}[f^2(x)] - E_{x \sim P}[f(x)]^2.$$
Returning to our investment example,
we can now compute the variance of the investment.
It is given by
In the same way as we introduced expectations
and variance for scalar random variables,
we can do so for vector-valued ones.
Expectations are easy, since we can apply them elementwise.
For instance,
$$\boldsymbol{\Sigma} \stackrel{\mathrm{def}}{=} \mathrm{Cov}{\mathbf{x} \sim P}[\mathbf{x}] = E{\mathbf{x} \sim P}\left[(\mathbf{x} - \boldsymbol{\mu}) (\mathbf{x} - \boldsymbol{\mu})^\top\right].$$
This matrix
As such,
In machine learning, there are many things to be uncertain about! We can be uncertain about the value of a label given an input. We can be uncertain about the estimated value of a parameter. We can even be uncertain about whether data arriving at deployment is even from the same distribution as the training data.
By aleatoric uncertainty, we denote that uncertainty
that is intrinsic to the problem,
and due to genuine randomness
unaccounted for by the observed variables.
By epistemic uncertainty, we denote uncertainty
over a model's parameters, the sort of uncertainty
that we can hope to reduce by collecting more data.
We might have epistemic uncertainty
concerning the probability
that a coin turns up heads,
but even once we know this probability,
we are left with aleatoric uncertainty
about the outcome of any future toss.
No matter how long we watch someone tossing a fair coin,
we will never be more or less than 50% certain
that the next toss will come up heads.
These terms owe to literature in mechanical modeling,
(see e.g., :citet:Der-Kiureghian.Ditlevsen.2009
for a review on this aspect of uncertainty quantification).
It is worth noting that these terms constitute a slight abuse of language.
The term epistemic refers to anything concerning knowledge
and thus in the philosophical sense, all uncertainty is epistemic.
We saw that sampling data from some unknown probability distribution
can provide us with information that can be used to estimate
the parameters of the data generating distribution.
That said, the rate at which this is possible can be quite slow.
In our coin tossing example (and many others)
we can do no better than to design estimators
that converge at a rate of Revels.Lubin.Papamarkou.2016
.
We also sharpened our language and tools for statistical modeling.
In the process of that we learned about conditional probabilities
and about one of the most important equations in statistics---Bayes' theorem.
It is an effective tool for decoupling information conveyed by data
through a likelihood term
Lastly, we introduced a first set of nontrivial questions
about the effect of a specific probability distribution,
namely expectations and variances.
While there are many more than just linear and quadratic
expectations for a probability distribution,
these two already provide a good deal of knowledge
about the possible behavior of the distribution.
For instance, Chebyshev's inequality
states that
- Give an example where observing more data can reduce the amount of uncertainty about the outcome to an arbitrarily low level.
- Give an example where observing more data will only reduce the amount of uncertainty up to a point and then no further. Explain why this is the case and where you expect this point to occur.
- We empirically demonstrated convergence to the mean for the toss of a coin. Calculate the variance of the estimate of the probability that we see a head after drawing
$n$ samples.- How does the variance scale with the number of observations?
- Use Chebyshev's inequality to bound the deviation from the expectation.
- How does it relate to the central limit theorem?
- Assume that we draw
$n$ samples$x_i$ from a probability distribution with zero mean and unit variance. Compute the averages$z_m \stackrel{\mathrm{def}}{=} m^{-1} \sum_{i=1}^m x_i$ . Can we apply Chebyshev's inequality for every$z_m$ independently? Why not? - Given two events with probability
$P(\mathcal{A})$ and$P(\mathcal{B})$ , compute upper and lower bounds on$P(\mathcal{A} \cup \mathcal{B})$ and$P(\mathcal{A} \cap \mathcal{B})$ . Hint: graph the situation using a Venn diagram. - Assume that we have a sequence of random variables, say
$A$ ,$B$ , and$C$ , where$B$ only depends on$A$ , and$C$ only depends on$B$ , can you simplify the joint probability$P(A, B, C)$ ? Hint: this is a Markov chain. - In :numref:
subsec_probability_hiv_app
, assume that the outcomes of the two tests are not independent. In particular assume that either test on its own has a false positive rate of 10% and a false negative rate of 1%. That is, assume that$P(D =1 \mid H=0) = 0.1$ and that$P(D = 0 \mid H=1) = 0.01$ . Moreover, assume that for$H = 1$ (infected) the test outcomes are conditionally independent, i.e., that$P(D_1, D_2 \mid H=1) = P(D_1 \mid H=1) P(D_2 \mid H=1)$ but that for healthy patients the outcomes are coupled via$P(D_1 = D_2 = 1 \mid H=0) = 0.02$ .- Work out the joint probability table for
$D_1$ and$D_2$ , given$H=0$ based on the information you have so far. - Derive the probability of the patient being positive (
$H=1$ ) after one test returns positive. You can assume the same baseline probability$P(H=1) = 0.0015$ as before. - Derive the probability of the patient being positive (
$H=1$ ) after both tests return positive.
- Work out the joint probability table for
- Assume that you are an asset manager for an investment bank and you have a choice of stocks
$s_i$ to invest in. Your portfolio needs to add up to$1$ with weights$\alpha_i$ for each stock. The stocks have an average return$\boldsymbol{\mu} = E_{\mathbf{s} \sim P}[\mathbf{s}]$ and covariance$\boldsymbol{\Sigma} = \mathrm{Cov}_{\mathbf{s} \sim P}[\mathbf{s}]$ .- Compute the expected return for a given portfolio
$\boldsymbol{\alpha}$ . - If you wanted to maximize the return of the portfolio, how should you choose your investment?
- Compute the variance of the portfolio.
- Formulate an optimization problem of maximizing the return while keeping the variance constrained to an upper bound. This is the Nobel-Prize winning Markovitz portfolio :cite:
Mangram.2013
. To solve it you will need a quadratic programming solver, something way beyond the scope of this book.
- Compute the expected return for a given portfolio
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: