Skip to content

Commit

Permalink
Rename event-->outcome, work on multinomial section
Browse files Browse the repository at this point in the history
  • Loading branch information
jhamrick committed Jul 6, 2014
1 parent 07addaf commit 90f14bc
Show file tree
Hide file tree
Showing 2 changed files with 270 additions and 65 deletions.
275 changes: 238 additions & 37 deletions sampler/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -131,42 +131,119 @@ instances of the *multinomial distribution*.

## The multinomial distribution

The multinomial distribution is used when you have several categories
of events, and you want to characterize the probability of some
combination of events happening. The classic example used to describe
the multinomial distribution is the *ball and urn* example. The idea
is that you have an urn with different colored balls in it (for
example, 30% red, 20% blue, and 50% green). You pull out a ball,
record its color, put it back in the urn, and then repeat this
multiple times. The multinomial distribution then tells you what the
probability is of selecting the balls that you did.
The multinomial distribution is used when you have several possible
outcomes, and you want to characterize the probability of each of
those outcomes occurring. The classic example used to describe the
multinomial distribution is the *ball and urn* example. The idea is
that you have an urn with different colored balls in it (for example,
30% red, 20% blue, and 50% green). You pull out a ball, record its
color, put it back in the urn, and then repeat this multiple times. In
this case, an *outcome* corresponds to drawing a ball of a particular
color, and the probability of each outcome corresponds to the
proportion of balls of that color (e.g., for the outcome of drawing a
blue ball, the probability is $p_{blue}=0.20$). The multinomial
distribution is then used to describe the possible combinations of
outcomes when multiple balls are drawn (e.g., two green and one blue).

Note: the code in this section is also located in the file
`multinomial.py`.

### The `MultinomialDistribution` class

In general, there are two functions that we might want to do with a
distribution: we might want to *sample* from that distribution, and we
might want to *evaluate the probability* of a sample (or samples)
under that distribution's PMF or PDF. While the actual computations
needed to perform these two functions are pretty different, they rely
on a common piece of information: what the *parameters* of the
distribution is. In the case of the multinomial distribution, the
parameters are the event probabilities, $p$, which correspond to the
proportions of the different colored balls in the urn example above.
In general, there are two use cases for a distribution: we might want
to *sample* from that distribution, and we might want to *evaluate the
probability* of a sample (or samples) under that distribution's PMF or
PDF. While the actual computations needed to perform these two
functions are fairly different, they rely on a common piece of
information: what the *parameters* of the distribution are. In the
case of the multinomial distribution, the parameters are the event
probabilities, $p$ (which correspond to the proportions of the
different colored balls in the urn example above).

The simplest solution would be to simply create two functions that
both take the same parameters, but are otherwise independent. However,
I will usually opt to use a class for representing my
distributions. There are several advantages to doing so:

1. You only need to pass in the parameters once, when creating the
class.
2. There are additional attributes we might want to know about a
distribution: the mean, variance, derivative, etc. Once we have
even a handful of functions that operate on a common object, it is
even more convenient to use a class rather than passing the same
parameters to many different functions.
3. It is usually a good idea to check that the parameter values are
valid (for example, in the case of the multinomial distribution,
the vector $p$ of event probabilities should sum to 1). It is much
more efficient to do this check once, in the constructor of the
class, rather than every time one of the functions is called.
4. Sometimes computing the PMF or PDF involves computing constant
values (given the parameters). With a class, we can pre-compute
these constants in the constructor, rather than having to compute
them every time the PMF or PDF function is called.

Here is the constructor code for the class:

```python
class MultinomialDistribution(object):

def __init__(self, p, rso=None):
"""Initialize the multinomial random variable.
Parameters
----------
p: numpy array with shape (k,)
The event probabilities
rso: numpy RandomState object (default: None)
The random number generator
"""
# Check that the probabilities sum to 1 -- if they don't, then
# something is wrong.
if not np.isclose(np.sum(p), 1.0):
raise ValueError("event probabilities do not sum to 1")

# Store the parameters that were passed in
self.p = p
self.rso = rso

# Precompute log probabilities, for use by the log-PMF.
self.logp = np.log(self.p)

# Get the appropriate function for generating the random
# samples, depending on whether we're using a RandomState
# object or not.
if self.rso:
self._sample_func = self.rso.multinomial
else:
self._sample_func = np.random.multinomial
```

The class takes as arguments the event probabilities, $p$, and a
variable called `rso`. First, the constructor checks that the
parameters are valid (i.e., that `p` sums to 1). It then stores the
arguments that were passed in, and uses the event probabilities to
compute the event *log* probabilities (we'll go into why this is
necessary in a bit). Finally, it determines which function to use for
sampling according to the value of `rso` (we'll talk about what the
`rso` object is and why we need to choose different sampling functions
in the next section).

TODO: include discussion about variable names (descriptive vs math?)

Because of this shared information, we opt to use a class to represent
a multinomial distribution. The class takes as parameters the event
probabilities, $p$, and a variable called `rso`. This variable is a
`RandomState` object, which is essentially a random number
generator.
### Sampling from a multinomial distribution

Taking a sample from a multinomial distribution is actuall fairly
straightforward, because NumPy provides us with a function that
already does it: `np.random.multinomial`.

TODO: talk about checking that parameter values are correct
> NumPy includes functions to draw samples from many different types
> of distributions. For a full list, take a look at the
> [random sampling module](http://docs.scipy.org/doc/numpy/reference/routines.random.html),
> `np.random`.
TODO: talk about precomputing values, like the log probabilities
Despite the fact that this function already exists, there are a few
design decisions surrounding it that we can make.

#### Seeding the random number generator

Expand All @@ -181,7 +258,7 @@ random numbers. We can accomplish this through use of a NumPy
`RandomState` object, which is essentially a random number generator
object that can be passed around. It has most of the same functions as
`np.random`; the difference is that we get to control where the random
numbers come from.
numbers come from. We create it as follows:

```python
>>> import numpy as np
Expand All @@ -191,7 +268,33 @@ numbers come from.
where the number passed to the `RandomState` constructor is the *seed*
for the random number generator. As long as we instantiate it with the
same seed, a `RandomState` object will produce the same "random"
numbers in the same order, thus ensuring replicability.
numbers in the same order, thus ensuring replicability:

```python
>>> rso.rand()
0.5356709186237074
>>> rso.rand()
0.6190581888276206
>>> rso.rand()
0.23143573416770336
>>> rso.seed(230489)
>>> rso.rand()
0.5356709186237074
>>> rso.rand()
0.6190581888276206
```

Earlier, we saw that the constructor took an argument called `rso` and
used it to determine the sampling function. This `rso` variable is a
`RandomState` object that has already been initialized. I like to make
the `RandomState` object an optional parameter: it is occasionally
convenient to not be *forced* to use it, but I do want to have the
*option* of using it (which, if I were to just use the `np.random`
function, I would not be able to do).

So, if the `rso` variable is not given, then the constructor defaults
to using `np.random.multinomial`. Otherwise, it uses the multinomial
sampler from the `RandomState` object itself.

> Aside: the functions in `np.random` actually do rely on a random
> number generator that we can control -- NumPy's global random number
Expand All @@ -204,15 +307,57 @@ numbers in the same order, thus ensuring replicability.
> is easier to find out whether there is nondeterminism coming from
> somewhere other than your own code.
### Sampling from a multinomial distribution
#### What's a parameter?

Once we've decided whether to use `np.random.multinomial` or
`rso.multinomial`, sampling is just a matter of calling the
appropriate function. However, there is one other decision that we
might consider: what counts as a parameter.

Earlier, I said that the outcome probabilities, $p$, were the
parameters of the multinomial distribution. However, depending on who
you ask, the number of events, $n$, can *also* be a parameter of the
multinomial distribution. So, why didn't we include $n$ as an argument
to the constructor?

This question, while relatively specific to the multinomial
distribution, actually comes up fairly frequently when dealing with
probability distributions, and the answer really depends on the use
case. For a multinomial, can you make the assumption that the number
of events is always the same? If so, then it might be better to pass
in $n$ as an argument to the constructor. If not, then requiring $n$
to be specified at object creation time could be very restrictive, and
might even require you to create a new distribution object every time
you need to draw a sample!

I usually don't like to be that restricted by my code, and thus choose
to have `n` be an argument to the `sample` function, rather than
having it be an argument to the constructor. An alternate solution
could be to have `n` be an argument to the constructor, but also
include methods to allow for the value of `n` to be changed, without
having to create an entirely new object. For our purposes, though,
this solution is probably overkill, so we'll stick to just having it
be an argument to `sample`:

So, to actually implement a method that samples from the multinomial
distribution, we can use the multinomial sampler from NumPy, and pass
in a `RandomState` object for reproducibility. I like to make the
`RandomState` object an optional parameter: it is occasionally
convenient to not be *forced* to use it, but I do want to have the
*option* of using it (which, if I were to just use the `np.random`
function, I would not be able to do).
```python
def sample(self, n):
"""Samples draws of `n` events from a multinomial distribution with
outcome probabilities `self.p`.
Parameters
----------
n: integer
The number of total events
Returns
-------
numpy array with shape (k,)
The sampled number of occurrences for each outcome
"""
x = self._sample_func(n, self.p)
return x
```

### Evaluating the multinomial PMF

Expand Down Expand Up @@ -351,10 +496,66 @@ otherwise be lost.

TODO

```python
def logpmf(self, x):
"""Evaluates the log-probability mass function (log-PMF) of a
multinomial with outcome probabilities `self.p` for a draw `x`.
Parameters
----------
x: numpy array with shape (k,)
The number of occurrences of each outcome
Returns
-------
The evaluated log-PMF for draw `x`
"""
# Get the total number of events.
n = np.sum(x)

# equivalent to log(n!)
numerator = gammaln(n + 1)
# equivalent to log(x1! * ... * xk!)
denominator = np.sum(gammaln(x + 1))

# If one of the values of self.p is 0, then the corresponding
# value of self.logp will be -inf. If the corresponding value
# of x is 0, then multiplying them together will give nan, but
# we want it to just be 0.
all_weights = self.logp * x
all_weights[x == 0] = 0
# equivalent to log(p1^x1 * ... * pk^xk)
weights = np.sum(all_weights)

# Put it all together.
log_pmf = numerator - denominator + weights
return log_pmf
```

If we really do need the PMF, and not the log-PMF, we can still
compute it. However, it is generally better to *first* compute it in
log-space, and then exponentiate it if we need to take it out of
log-space.
log-space:

```python
def pmf(self, x):
"""Evaluates the probability mass function (PMF) of a multinomial
with outcome probabilities `self.p` for a draw `x`.
Parameters
----------
x: numpy array with shape (k,)
The number of occurrences of each outcome
Returns
-------
The evaluated PMF for draw `x`
"""
pmf = np.exp(self.logpmf(x))
return pmf
```

To further drive home the point of why working in log-space is so
important, we can look at an example just with the multinomial:
Expand All @@ -379,7 +580,7 @@ inf
>>> gammaln(1000 + 1)
5912.1281784881639
```

## Sampling magical items, revisited

Now that we have written our multinomial functions, we can put them to
Expand Down
Loading

0 comments on commit 90f14bc

Please sign in to comment.