Skip to content

Commit

Permalink
Merge pull request #409 from hyanwong/vgamma-default
Browse files Browse the repository at this point in the history
Fill out more rescaling explanation
  • Loading branch information
hyanwong authored Jun 11, 2024
2 parents cc617de + c12d90a commit 8d8bf20
Show file tree
Hide file tree
Showing 3 changed files with 37 additions and 20 deletions.
14 changes: 12 additions & 2 deletions docs/methods.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,8 @@ Pros
: Old nodes do not suffer from time-discretisation issues caused by forcing
bounds on the oldest times
: Iterative updating properly accounts for cycles in the genealogy
: No need to specify prior times
: Can account for variable population sizes using rescaling

Cons
: Assumes posterior times can be reasonably modelled by gamma distributions
Expand All @@ -89,6 +91,7 @@ local estimates to each gamma distribution are iteratively refined until
they converge to a stable solution. This comes under a class of approaches
sometimes known as "loopy belief propagation".

(sec_methods_expectation_propagation)=
#### Expectation propagation

We are in the process of writing a formal description of the algorithm, but in
Expand All @@ -115,8 +118,15 @@ ts = tsdate.date(input_ts, mutation_rate=1e-8, progress=True)
(sec_rescaling)=
#### Rescaling

The `variational_gamma` method implements a step that we call *rescaling*, which can account for
the effects of variable population size though time.
During each EP step, the `variational_gamma` method implements a further process
that we call *rescaling*, and which can help to deal with the effects of variable population
size though time. Basically, time is broken up into a number of intervals, and times within
intervals are simultaneously scaled such that the expected density of mutations along each
path from a sample to the root best matches the mutational density predicted from the
user-provided mutation rate. The number of intervals can be specified using the
`rescaling_intervals` parameter. If set to 0, no rescaling is performed; this means that
dates may be inaccurately estimated if the dataset comes from a set of samples with a complex
demographic history.

TODO: describe the rescaling step in more detail. Could also link to [the population size docs](sec_popsize)

Expand Down
30 changes: 18 additions & 12 deletions docs/popsize.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,10 +37,11 @@ import msprime
import demesdraw
from matplotlib import pyplot as plt
bottleneck_time = 10000
demography = msprime.Demography()
demography.add_population(name="Population", initial_size=5e4)
demography.add_population_parameters_change(time=10000, initial_size=100)
demography.add_population_parameters_change(time=10080, initial_size=2e3)
demography.add_population_parameters_change(time=bottleneck_time, initial_size=100)
demography.add_population_parameters_change(time=bottleneck_time + 80, initial_size=2e3)
mutation_rate = 1e-8
# Simulate a short tree sequence with a population size history.
Expand All @@ -49,7 +50,11 @@ ts = msprime.sim_ancestry(
ts = msprime.sim_mutations(ts, rate=mutation_rate, random_seed=321)
fig, ax = plt.subplots(1, 1, figsize=(4, 6))
demesdraw.tubes(demography.to_demes(), ax, scale_bar=True)
ax.annotate("bottleneck", xy=(0, 8e3), xytext=(1e4, 8.2e3), arrowprops=dict(arrowstyle="->"))
ax.annotate(
"bottleneck",
xy=(0, bottleneck_time),
xytext=(1e4, bottleneck_time * 1.04),
arrowprops=dict(arrowstyle="->"))
```

To test how well tsdate does in this situation, we can redate the known (true) tree sequence topology,
Expand Down Expand Up @@ -119,17 +124,18 @@ ax.set_ylabel("Population size", rotation=90);
## Misspecified priors

The flat prior for the default `variational_gamma` [method](sec_methods) is robust to
deviations from neutrality and panmixia. However, approaches such as the `inside_outside`
method by default use a coalescent prior which assumes a fixed population size, and hence
these perform very poorly on such data:
deviations from neutrality and panmixia. However, alternative approaches such as the
`inside_outside` method default to a coalescent prior that assumes a fixed population size.
Hence these approaches currently perform very poorly on such data:

```{code-cell} ipython3
import tsdate
est_pop_size = ts.diversity() / (4 * mutation_rate) # calculate av Ne from data
redated_ts = tsdate.inside_outside(ts, mutation_rate=mutation_rate, population_size=est_pop_size)
unconstr_times = [nd.metadata.get("mn", nd.time) for nd in redated_ts.nodes()]
fig, ax = plt.subplots(1, 1, figsize=(15, 3))
plot_real_vs_tsdate_times(ax, ts.nodes_time, unconstr_times, ts, redated_ts, alpha=0.1)
fig, ax = plt.subplots(1, 1, figsize=(5, 5))
title = "inside_outside method; prior based on fixed Ne"
plot_real_vs_tsdate_times(ax, ts.nodes_time, unconstr_times, ts, redated_ts, alpha=0.1, title=title)
```

If you cannot use the `variational_gamma` method,
Expand All @@ -142,15 +148,15 @@ for its use and interpretation.

### Estimating Ne from data

If you are constructing a coalescent prior, but don't have an established estimate
for the effective population size of your data,
a rough approximation is to use the (sitewise) genetic diversity divided by
four-times the mutation rate:
In the example above, in the absence of an expected effective population size for use in the
`inside_outside` method, we used a value approximated from the data. The standard way to do so
is to use the (sitewise) genetic diversity divided by four-times the mutation rate:

```{code-cell} ipython3
print("A rough estimate of the effective population size is", ts.diversity() / (4 * 1e-6))
```


<!--
## Setting variable population sizes
Expand Down
13 changes: 7 additions & 6 deletions docs/usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -282,12 +282,13 @@ has not been parallelised to allow running on many CPU cores.

#### Continuous time optimisations

TODO: Explain the `max_iterations` parameter

:::{note}
Rather than using a fixed number of iterations, future updates to _tsdate_ may
carry out iterations until some convergence criterion has been reached.
:::
The [expectation progagation](sec_methods_expectation_propagation) approach
in the `variational_gamma` method involves iteratively refining the
local time estimates. The number of rounds of iteration can be
set via the `max_iterations` parameter. Reducing this will speed up _tsdate_
inference, but may produce worse date estimation. Future updates to _tsdate_ may
optimise this so that iterations are halted after an appropriate convergence criterion
has been reached.

#### Continuous time optimisations

Expand Down

0 comments on commit 8d8bf20

Please sign in to comment.