Skip to content

Commit 2008cb3

Browse files
Merge pull request #176 from hyanwong/document-path-compression-mc
Set keep_unary on simplify, and document (including path compression docs)
2 parents 4d13b81 + 899ebf5 commit 2008cb3

File tree

8 files changed

+110
-60
lines changed

8 files changed

+110
-60
lines changed

docs/api.rst

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -42,8 +42,7 @@ Running inference
4242

4343
.. autofunction:: tsinfer.match_samples
4444

45-
.. todo::
46-
1. Add documentation for path compression here.
45+
.. autofunction:: tsinfer.augment_ancestors
4746

4847
++++++++++
4948
Exceptions

docs/inference.rst

Lines changed: 25 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -161,6 +161,19 @@ given time, and an ancestor can copy from any older ancestor. For each
161161
ancestor, we find a path through older ancestors that minimises the
162162
number of recombination events.
163163

164+
As well as minimising recombination events by finding the best path, we can also
165+
minimise events by looking for *shared recombination breakpoints*. A shared
166+
breakpoint exists if a set of children share a breakpoint in the same position,
167+
and they also have identical parents to the left of the breakpoint and identical
168+
parents to the right of the breakpoint. Rather than supposing that these
169+
children experienced multiple identical recombination events in parallel, we can
170+
reduce the number of ancestral recombination events by postulating a "synthetic
171+
ancestor" with this breakpoint, existing at a slightly older point
172+
in time, from whom all the children are descended at this genomic position. We
173+
call the algorithm used to implement this addition to the ancestral copying
174+
paths, "path compression".
175+
176+
164177
.. todo:: Schematic of the ancestors copying process.
165178

166179
The copying path for each ancestor then describes its ancestry at every
@@ -183,7 +196,9 @@ The final phase of a ``tsinfer`` inference consists of a number steps:
183196
1. The first (and usually most time-consuming) is to find copying paths
184197
for our sample haplotypes through the ancestors. Each copying path
185198
corresponds to a set of tree sequence edges in precisely the same
186-
way as for ancestors.
199+
way as for ancestors, and the path compression algorithm can be equally
200+
applied here.
201+
187202

188203
2. As we only use a subset of the available sites for inference
189204
(excluding by default any sites that are fixed or singletons)
@@ -198,13 +213,18 @@ The final phase of a ``tsinfer`` inference consists of a number steps:
198213
may be clades of ancestral state samples which would allow us
199214
to encode the data with fewer back mutations.**
200215

201-
3. Reduce the resulting tree sequence to a canonical form by
202-
`simplifying it
203-
<https://tskit.readthedocs.io/en/latest/python-api.html#tskit.TreeSequence.simplify>`_.
216+
3. Remove ancestral paths that do not lead to any of the samples by
217+
`simplifying
218+
<https://tskit.readthedocs.io/en/latest/python-api.html#tskit.TreeSequence.simplify>`_
219+
the final tree sequence. When simplifying, we keep non-branching ("unary")
220+
nodes, as they indicate ancestors which we have actively inferred, and
221+
for technical reasons keeping unary ancestors can also lead to better
222+
compression. Note that this means that not every internal node in the
223+
inferred tree sequence will correspond to a coalescent event.
204224

205225
.. todo::
206226
1. Describe path compression here and above in the ancestors
207227
section
208-
2. Describe the structure of the outpt tree sequences; how the
228+
2. Describe the structure of the output tree sequences; how the
209229
nodes are mapped, what the time values mean, etc.
210230

docs/installation.rst

Lines changed: 2 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,8 +4,8 @@
44
Installation
55
############
66

7-
Python 3.3 or newer is required for ``tsinfer``. Any Unix-like platform should work
8-
(``tsinfer`` is tested on Linux and OSX).
7+
Python 3.5 or newer is required for ``tsinfer``. Any Unix-like platform should
8+
work (``tsinfer`` is tested on Linux, OS X, and Windows).
99

1010
Please use ``pip`` to install,
1111
e.g.::
@@ -33,8 +33,6 @@ first using `venv <https://docs.python.org/3/library/venv.html>`_::
3333
(tsinfer-venv) $ pip install tsinfer
3434
(tsinfer-venv) $ tsinfer --help
3535

36-
.. _sec_installation_installation_problems:
37-
3836
****************
3937
Potential issues
4038
****************

evaluation.py

Lines changed: 7 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -373,9 +373,13 @@ def edge_plot(ts, filename):
373373
for tree in ts.trees():
374374
left, right = tree.interval
375375
for u in tree.nodes():
376-
for c in tree.children(u):
377-
lines.append([(left, c), (right, c)])
378-
colours.append(pallete[unrank(tree.samples(c), n)])
376+
children = tree.children(u)
377+
# Don't bother plotting unary nodes, which will all have the same
378+
# samples under them as their next non-unary descendant
379+
if len(children) > 1:
380+
for c in children:
381+
lines.append([(left, c), (right, c)])
382+
colours.append(pallete[unrank(tree.samples(c), n)])
379383

380384
lc = mc.LineCollection(lines, linewidths=2, colors=colours)
381385
fig, ax = plt.subplots()

setup.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ def finalize_options(self):
4949
"tqdm",
5050
"humanize",
5151
"daiquiri",
52-
"tskit",
52+
"tskit>=0.2.1",
5353
"numcodecs>=0.6",
5454
"zarr>=2.2",
5555
"lmdb",

tests/test_inference.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -610,13 +610,13 @@ def verify_inserted_ancestors(self, ts):
610610
with tsinfer.SampleData(sequence_length=ts.sequence_length) as sample_data:
611611
for v in ts.variants():
612612
sample_data.add_site(v.position, v.genotypes, v.alleles)
613-
614613
ancestor_data = tsinfer.AncestorData(sample_data)
615614
tsinfer.build_simulated_ancestors(sample_data, ancestor_data, ts)
616615
ancestor_data.finalise()
617616

618-
A = np.zeros(
619-
(ancestor_data.num_sites, ancestor_data.num_ancestors), dtype=np.uint8)
617+
A = np.full(
618+
(ancestor_data.num_sites, ancestor_data.num_ancestors),
619+
tskit.MISSING_DATA, dtype=np.int8)
620620
start = ancestor_data.ancestors_start[:]
621621
end = ancestor_data.ancestors_end[:]
622622
ancestors = ancestor_data.ancestors_haplotype[:]
@@ -628,8 +628,7 @@ def verify_inserted_ancestors(self, ts):
628628
tsinfer.check_ancestors_ts(ancestors_ts)
629629
self.assertEqual(ancestor_data.num_sites, ancestors_ts.num_sites)
630630
self.assertEqual(ancestor_data.num_ancestors, ancestors_ts.num_samples)
631-
self.assertTrue(np.array_equal(
632-
ancestors_ts.genotype_matrix(impute_missing_data=True), A))
631+
self.assertTrue(np.array_equal(ancestors_ts.genotype_matrix(), A))
633632
inferred_ts = tsinfer.match_samples(
634633
sample_data, ancestors_ts, engine=engine)
635634
self.assertTrue(np.array_equal(
@@ -1230,6 +1229,14 @@ def verify(self, ts):
12301229
list(ts2.samples()),
12311230
list(range(ts2.num_nodes - n, ts2.num_nodes)))
12321231

1232+
# Check that we're calling simplify with the correct arguments.
1233+
ts2 = tsinfer.infer(sd, simplify=False).simplify(keep_unary=True)
1234+
t1 = ts1.dump_tables()
1235+
t2 = ts2.dump_tables()
1236+
t1.provenances.clear()
1237+
t2.provenances.clear()
1238+
self.assertEqual(t1, t2)
1239+
12331240
def test_single_tree(self):
12341241
ts = msprime.simulate(5, random_seed=1, mutation_rate=2)
12351242
self.verify(ts)

tsinfer/eval_util.py

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -705,6 +705,8 @@ def run_perfect_inference(
705705
num_threads=num_threads, extended_checks=extended_checks,
706706
progress_monitor=progress_monitor,
707707
stabilise_node_ordering=time_chunking and not path_compression)
708+
# to compare against the original, we need to remove unary nodes from the inferred TS
709+
inferred_ts = inferred_ts.simplify(keep_unary=False, filter_sites=False)
708710
return ts, inferred_ts
709711

710712

tsinfer/inference.py

Lines changed: 60 additions & 40 deletions
Original file line numberDiff line numberDiff line change
@@ -139,10 +139,10 @@ def verify(samples, tree_sequence, progress_monitor=None):
139139

140140

141141
def infer(
142-
sample_data, progress_monitor=None, num_threads=0, path_compression=True,
143-
simplify=True, engine=constants.C_ENGINE):
142+
sample_data, num_threads=0, path_compression=True, simplify=True,
143+
engine=constants.C_ENGINE, progress_monitor=None):
144144
"""
145-
infer(sample_data, num_threads=0)
145+
infer(sample_data, num_threads=0, path_compression=True, simplify=True)
146146
147147
Runs the full :ref:`inference pipeline <sec_inference>` on the specified
148148
:class:`SampleData` instance and returns the inferred
@@ -153,13 +153,19 @@ def infer(
153153
:param int num_threads: The number of worker threads to use in parallelised
154154
sections of the algorithm. If <= 0, do not spawn any threads and
155155
use simpler sequential algorithms (default).
156+
:param bool path_compression: Whether to merge edges that share identical
157+
paths (essentially taking advantage of shared recombination breakpoints).
158+
:param bool simplify: Whether to remove extra tree nodes and edges that are not
159+
on a path between the root and any of the samples. To do so, the final tree
160+
sequence is simplified by appling the :meth:`tskit.TreeSequence.simplify` method
161+
with ``keep_unary`` set to True (default = ``True``).
156162
:returns: The :class:`tskit.TreeSequence` object inferred from the
157163
input sample data.
158164
:rtype: tskit.TreeSequence
159165
"""
160166
ancestor_data = generate_ancestors(
161-
sample_data, engine=engine, progress_monitor=progress_monitor,
162-
num_threads=num_threads)
167+
sample_data, num_threads=num_threads,
168+
engine=engine, progress_monitor=progress_monitor,)
163169
ancestors_ts = match_ancestors(
164170
sample_data, ancestor_data, engine=engine, num_threads=num_threads,
165171
path_compression=path_compression, progress_monitor=progress_monitor)
@@ -171,8 +177,8 @@ def infer(
171177

172178

173179
def generate_ancestors(
174-
sample_data, num_threads=0, progress_monitor=None, engine=constants.C_ENGINE,
175-
**kwargs):
180+
sample_data, num_threads=0, path=None,
181+
engine=constants.C_ENGINE, progress_monitor=None, **kwargs):
176182
"""
177183
generate_ancestors(sample_data, num_threads=0, path=None, **kwargs)
178184
@@ -186,30 +192,32 @@ def generate_ancestors(
186192
187193
ancestor_data = tsinfer.generate_ancestors(sample_data, path="mydata.ancestors")
188194
189-
All other keyword arguments are passed to the :class:`AncestorData` constructor,
195+
Other keyword arguments are passed to the :class:`AncestorData` constructor,
190196
which may be used to control the storage properties of the generated file.
191197
192198
:param SampleData sample_data: The :class:`SampleData` instance that we are
193199
genering putative ancestors from.
194200
:param int num_threads: The number of worker threads to use. If < 1, use a
195201
simpler synchronous algorithm.
202+
:param str path: The path of the file to store the sample data. If None,
203+
the information is stored in memory and not persistent.
196204
:rtype: AncestorData
197205
:returns: The inferred ancestors stored in an :class:`AncestorData` instance.
198206
"""
199207
progress_monitor = _get_progress_monitor(progress_monitor)
200-
with formats.AncestorData(sample_data, **kwargs) as ancestor_data:
208+
with formats.AncestorData(sample_data, path=path, **kwargs) as ancestor_data:
201209
generator = AncestorsGenerator(
202-
sample_data, ancestor_data, progress_monitor, engine=engine,
203-
num_threads=num_threads)
210+
sample_data, ancestor_data, num_threads=num_threads,
211+
engine=engine, progress_monitor=progress_monitor)
204212
generator.add_sites()
205213
generator.run()
206214
ancestor_data.record_provenance("generate-ancestors")
207215
return ancestor_data
208216

209217

210218
def match_ancestors(
211-
sample_data, ancestor_data, progress_monitor=None, num_threads=0,
212-
path_compression=True, extended_checks=False, engine=constants.C_ENGINE):
219+
sample_data, ancestor_data, num_threads=0, path_compression=True,
220+
extended_checks=False, engine=constants.C_ENGINE, progress_monitor=None):
213221
"""
214222
match_ancestors(sample_data, ancestor_data, num_threads=0, path_compression=True)
215223
@@ -225,24 +233,25 @@ def match_ancestors(
225233
a history for.
226234
:param int num_threads: The number of match worker threads to use. If
227235
this is <= 0 then a simpler sequential algorithm is used (default).
228-
:param bool path_compression: Should we try to merge edges that share identical
229-
paths (essentially taking advantage of shared recombination breakpoints)
236+
:param bool path_compression: Whether to merge edges that share identical
237+
paths (essentially taking advantage of shared recombination breakpoints).
230238
:return: The ancestors tree sequence representing the inferred history
231239
of the set of ancestors.
232240
:rtype: tskit.TreeSequence
233241
"""
234242
matcher = AncestorMatcher(
235-
sample_data, ancestor_data, engine=engine,
236-
progress_monitor=progress_monitor, path_compression=path_compression,
237-
num_threads=num_threads, extended_checks=extended_checks)
243+
sample_data, ancestor_data, num_threads=num_threads,
244+
path_compression=path_compression, extended_checks=extended_checks,
245+
engine=engine, progress_monitor=progress_monitor)
238246
return matcher.match_ancestors()
239247

240248

241249
def augment_ancestors(
242-
sample_data, ancestors_ts, indexes, progress_monitor=None, num_threads=0,
243-
path_compression=True, extended_checks=False, engine=constants.C_ENGINE):
250+
sample_data, ancestors_ts, indexes, num_threads=0, path_compression=True,
251+
extended_checks=False, engine=constants.C_ENGINE, progress_monitor=None):
244252
"""
245-
augment_ancestors(sample_data, ancestors_ts, indexes, num_threads=0, simplify=True)
253+
augment_ancestors(sample_data, ancestors_ts, indexes, num_threads=0,\
254+
path_compression=True)
246255
247256
Runs the sample matching :ref:`algorithm <sec_inference_match_samples>`
248257
on the specified :class:`SampleData` instance and ancestors tree sequence,
@@ -260,32 +269,34 @@ def augment_ancestors(
260269
tree sequence.
261270
:param int num_threads: The number of match worker threads to use. If
262271
this is <= 0 then a simpler sequential algorithm is used (default).
272+
:param bool path_compression: Whether to merge edges that share identical
273+
paths (essentially taking advantage of shared recombination breakpoints).
263274
:return: The specified ancestors tree sequence augmented with copying
264275
paths for the specified sample.
265276
:rtype: tskit.TreeSequence
266277
"""
267278
manager = SampleMatcher(
268-
sample_data, ancestors_ts, path_compression=path_compression,
269-
engine=engine, progress_monitor=progress_monitor, num_threads=num_threads,
270-
extended_checks=extended_checks)
279+
sample_data, ancestors_ts, num_threads=num_threads,
280+
path_compression=path_compression, extended_checks=extended_checks,
281+
engine=engine, progress_monitor=progress_monitor
282+
)
271283
manager.match_samples(indexes)
272284
ts = manager.get_augmented_ancestors_tree_sequence(indexes)
273285
return ts
274286

275287

276288
def match_samples(
277-
sample_data, ancestors_ts, progress_monitor=None, num_threads=0,
278-
path_compression=True, simplify=True, extended_checks=False,
279-
stabilise_node_ordering=False, engine=constants.C_ENGINE):
289+
sample_data, ancestors_ts, num_threads=0, path_compression=True, simplify=True,
290+
extended_checks=False, stabilise_node_ordering=False, engine=constants.C_ENGINE,
291+
progress_monitor=None):
280292
"""
281-
match_samples(sample_data, ancestors_ts, num_threads=0, simplify=True)
293+
match_samples(sample_data, ancestors_ts, num_threads=0, path_compression=True,\
294+
simplify=True)
282295
283296
Runs the sample matching :ref:`algorithm <sec_inference_match_samples>`
284297
on the specified :class:`SampleData` instance and ancestors tree sequence,
285298
returning the final :class:`tskit.TreeSequence` instance containing
286-
the full inferred history for all samples and sites. If ``simplify`` is
287-
True (the default) run :meth:`tskit.TreeSequence.simplify` on the
288-
inferred tree sequence to ensure that it is in canonical form.
299+
the full inferred history for all samples and sites.
289300
290301
:param SampleData sample_data: The :class:`SampleData` instance
291302
representing the input data.
@@ -294,14 +305,20 @@ def match_samples(
294305
history among ancestral ancestral haplotypes.
295306
:param int num_threads: The number of match worker threads to use. If
296307
this is <= 0 then a simpler sequential algorithm is used (default).
308+
:param bool path_compression: Whether to merge edges that share identical
309+
paths (essentially taking advantage of shared recombination breakpoints).
310+
:param bool simplify: Whether to remove extra tree nodes and edges that are not
311+
on a path between the root and any of the samples. To do so, the final tree
312+
sequence is simplified by appling the :meth:`tskit.TreeSequence.simplify` method
313+
with ``keep_unary`` set to True (default = ``True``).
297314
:return: The tree sequence representing the inferred history
298315
of the sample.
299316
:rtype: tskit.TreeSequence
300317
"""
301318
manager = SampleMatcher(
302-
sample_data, ancestors_ts, path_compression=path_compression,
303-
engine=engine, progress_monitor=progress_monitor, num_threads=num_threads,
304-
extended_checks=extended_checks)
319+
sample_data, ancestors_ts, num_threads=num_threads,
320+
path_compression=path_compression, extended_checks=extended_checks,
321+
engine=engine, progress_monitor=progress_monitor)
305322
manager.match_samples()
306323
ts = manager.finalise(
307324
simplify=simplify, stabilise_node_ordering=stabilise_node_ordering)
@@ -313,7 +330,8 @@ class AncestorsGenerator(object):
313330
Manages the process of building ancestors.
314331
"""
315332
def __init__(
316-
self, sample_data, ancestor_data, progress_monitor, engine, num_threads=0):
333+
self, sample_data, ancestor_data, num_threads=0,
334+
engine=constants.C_ENGINE, progress_monitor=None):
317335
self.sample_data = sample_data
318336
self.ancestor_data = ancestor_data
319337
self.progress_monitor = progress_monitor
@@ -453,8 +471,8 @@ def run(self):
453471
class Matcher(object):
454472

455473
def __init__(
456-
self, sample_data, num_threads=1, engine=constants.C_ENGINE,
457-
path_compression=True, progress_monitor=None, extended_checks=False):
474+
self, sample_data, num_threads=1, path_compression=True,
475+
extended_checks=False, engine=constants.C_ENGINE, progress_monitor=None):
458476
self.sample_data = sample_data
459477
self.num_threads = num_threads
460478
self.path_compression = path_compression
@@ -1106,8 +1124,9 @@ def finalise(self, simplify=True, stabilise_node_ordering=False):
11061124
logger.info("Finalising tree sequence")
11071125
ts = self.get_samples_tree_sequence()
11081126
if simplify:
1109-
logger.info("Running simplify on {} nodes and {} edges".format(
1110-
ts.num_nodes, ts.num_edges))
1127+
logger.info(
1128+
"Running simplify(keep_unary=True) on {} nodes and {} edges".format(
1129+
ts.num_nodes, ts.num_edges))
11111130
if stabilise_node_ordering:
11121131
# Ensure all the node times are distinct so that they will have
11131132
# stable IDs after simplifying. This could possibly also be done
@@ -1122,7 +1141,8 @@ def finalise(self, simplify=True, stabilise_node_ordering=False):
11221141
tables.nodes.set_columns(flags=tables.nodes.flags, time=time)
11231142
tables.sort()
11241143
ts = tables.tree_sequence()
1125-
ts = ts.simplify(samples=self.sample_ids, filter_sites=False)
1144+
ts = ts.simplify(
1145+
samples=self.sample_ids, filter_sites=False, keep_unary=True)
11261146
logger.info("Finished simplify; now have {} nodes and {} edges".format(
11271147
ts.num_nodes, ts.num_edges))
11281148
return ts

0 commit comments

Comments
 (0)