Merge pull request #409 from hyanwong/vgamma-default

Fill out more rescaling explanation
tskit-dev · Jun 11, 2024 · 8d8bf20 · 8d8bf20
2 parents cc617de + c12d90a
commit 8d8bf20
Show file tree

Hide file tree

Showing 3 changed files with 37 additions and 20 deletions.
diff --git a/docs/methods.md b/docs/methods.md
@@ -68,6 +68,8 @@ Pros
 : Old nodes do not suffer from time-discretisation issues caused by forcing
     bounds on the oldest times
 : Iterative updating properly accounts for cycles in the genealogy
+: No need to specify prior times
+: Can account for variable population sizes using rescaling
 
 Cons
 : Assumes posterior times can be reasonably modelled by gamma distributions
@@ -89,6 +91,7 @@ local estimates to each gamma distribution are iteratively refined until
 they converge to a stable solution.  This comes under a class of approaches
 sometimes known as "loopy belief propagation".
 
+(sec_methods_expectation_propagation)=
 #### Expectation propagation
 
 We are in the process of writing a formal description of the algorithm, but in
@@ -115,8 +118,15 @@ ts = tsdate.date(input_ts, mutation_rate=1e-8, progress=True)
 (sec_rescaling)=
 #### Rescaling
 
-The `variational_gamma` method implements a step that we call *rescaling*, which can account for
-the effects of variable population size though time.
+During each EP step, the `variational_gamma` method implements a further process
+that we call *rescaling*, and which can help to deal with the effects of variable population
+size though time. Basically, time is broken up into a number of intervals, and times within
+intervals are simultaneously scaled such that the expected density of mutations along each
+path from a sample to the root best matches the mutational density predicted from the
+user-provided mutation rate. The number of intervals can be specified using the
+`rescaling_intervals` parameter. If set to 0, no rescaling is performed; this means that
+dates may be inaccurately estimated if the dataset comes from a set of samples with a complex
+demographic history.
 
 TODO: describe the rescaling step in more detail. Could also link to [the population size docs](sec_popsize)
 

diff --git a/docs/popsize.md b/docs/popsize.md
@@ -37,10 +37,11 @@ import msprime
 import demesdraw
 from matplotlib import pyplot as plt
 
+bottleneck_time = 10000
 demography = msprime.Demography()
 demography.add_population(name="Population", initial_size=5e4)
-demography.add_population_parameters_change(time=10000, initial_size=100)
-demography.add_population_parameters_change(time=10080, initial_size=2e3)
+demography.add_population_parameters_change(time=bottleneck_time, initial_size=100)
+demography.add_population_parameters_change(time=bottleneck_time + 80, initial_size=2e3)
 
 mutation_rate = 1e-8
 # Simulate a short tree sequence with a population size history.
@@ -49,7 +50,11 @@ ts = msprime.sim_ancestry(
 ts = msprime.sim_mutations(ts, rate=mutation_rate, random_seed=321)
 fig, ax = plt.subplots(1, 1, figsize=(4, 6))
 demesdraw.tubes(demography.to_demes(), ax, scale_bar=True)
-ax.annotate("bottleneck", xy=(0, 8e3), xytext=(1e4, 8.2e3), arrowprops=dict(arrowstyle="->"))
+ax.annotate(
+  "bottleneck",
+  xy=(0, bottleneck_time),
+  xytext=(1e4, bottleneck_time * 1.04),
+  arrowprops=dict(arrowstyle="->"))
 ```
 
 To test how well tsdate does in this situation, we can redate the known (true) tree sequence topology,
@@ -119,17 +124,18 @@ ax.set_ylabel("Population size", rotation=90);
 ## Misspecified priors
 
 The flat prior for the default `variational_gamma` [method](sec_methods) is robust to
-deviations from neutrality and panmixia. However, approaches such as the `inside_outside`
-method by default use a coalescent prior which assumes a fixed population size, and hence
-these perform very poorly on such data:
+deviations from neutrality and panmixia. However, alternative approaches such as the
+`inside_outside` method default to a coalescent prior that assumes a fixed population size.
+Hence these approaches currently perform very poorly on such data:
 
 ```{code-cell} ipython3
 import tsdate
 est_pop_size = ts.diversity() / (4 * mutation_rate)  # calculate av Ne from data
 redated_ts = tsdate.inside_outside(ts, mutation_rate=mutation_rate, population_size=est_pop_size)
 unconstr_times = [nd.metadata.get("mn", nd.time) for nd in redated_ts.nodes()]
-fig, ax = plt.subplots(1, 1, figsize=(15, 3))
-plot_real_vs_tsdate_times(ax, ts.nodes_time, unconstr_times, ts, redated_ts, alpha=0.1)
+fig, ax = plt.subplots(1, 1, figsize=(5, 5))
+title = "inside_outside method; prior based on fixed Ne"
+plot_real_vs_tsdate_times(ax, ts.nodes_time, unconstr_times, ts, redated_ts, alpha=0.1, title=title)
 ```
 
 If you cannot use the `variational_gamma` method, 
@@ -142,15 +148,15 @@ for its use and interpretation.
 
 ### Estimating Ne from data
 
-If you are constructing a coalescent prior, but don't have an established estimate
-for the effective population size of your data,
-a rough approximation is to use the (sitewise) genetic diversity divided by
-four-times the mutation rate:
+In the example above, in the absence of an expected effective population size for use in the
+`inside_outside` method, we used a value approximated from the data. The standard way to do so
+is to use the (sitewise) genetic diversity divided by four-times the mutation rate:
 
 ```{code-cell} ipython3
 print("A rough estimate of the effective population size is", ts.diversity() / (4 * 1e-6))
 ```
 
+
 <!--
 
 ## Setting variable population sizes

diff --git a/docs/usage.md b/docs/usage.md
@@ -282,12 +282,13 @@ has not been parallelised to allow running on many CPU cores.
 
 #### Continuous time optimisations
 
-TODO: Explain the `max_iterations` parameter
-
-:::{note}
-Rather than using a fixed number of iterations, future updates to _tsdate_ may
-carry out iterations until some convergence criterion has been reached.
-:::
+The [expectation progagation](sec_methods_expectation_propagation) approach
+in the `variational_gamma` method involves iteratively refining the
+local time estimates. The number of rounds of iteration can be
+set via the `max_iterations` parameter. Reducing this will speed up _tsdate_
+inference, but may produce worse date estimation. Future updates to _tsdate_ may
+optimise this so that iterations are halted after an appropriate convergence criterion
+has been reached.
 
 #### Continuous time optimisations