Allow missing data in samples #169

hyanwong · 2019-07-19T15:18:49Z

For brief review, but not merging yet. Addresses #153

hyanwong · 2019-07-19T15:22:35Z

Ping @marianne-aspbury - this is not quite ready yet, as the Li & Stevens matching algorithm doesn't deal with missing data properly, but it's a good start, and the test suite passes. Most of the fundamental work is in algorithm.py

codecov · 2019-07-19T15:31:12Z

Codecov Report

Merging #169 into master will decrease coverage by 0.07%.
The diff coverage is 93.61%.

@@            Coverage Diff             @@
##           master     #169      +/-   ##
==========================================
- Coverage   91.86%   91.79%   -0.08%     
==========================================
  Files          15       15              
  Lines        4489     4523      +34     
  Branches      807      818      +11     
==========================================
+ Hits         4124     4152      +28     
- Misses        245      249       +4     
- Partials      120      122       +2

Flag	Coverage Δ
#C	`91.79% <93.61%> (-0.08%)`	⬇️
#python	`95.00% <94.23%> (-0.08%)`	⬇️

Impacted Files	Coverage Δ
tsinfer/eval_util.py	`87.84% <ø> (ø)`
tsinfer/inference.py	`98.27% <50.00%> (-0.47%)`	⬇️
lib/ancestor_builder.c	`87.79% <92.85%> (-0.34%)`	⬇️
tsinfer/algorithm.py	`98.28% <100.00%> (+<0.01%)`	⬆️
tsinfer/formats.py	`96.97% <100.00%> (+0.04%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e01b15f...5bb2089. Read the comment docs.

hyanwong · 2019-08-07T17:44:06Z

NB: currently failing tests because it relies on tskit v0.2.0. Tests pass fine on my machine with the new tskit, so I'm simply waiting for the next tskit release before seeking review.

tsinfer/inference.py

hyanwong · 2019-08-12T11:09:46Z

Currently the locate_mutations_on_tree() functionality is failing on some inferred tree sequences, because samples that are missing all inference sites get added to a new "artificial root", which is not attached to anywhere else. If the other samples present at that genomic location already fall in a tree, this creates a location with 2 roots, which isn't properly handled by the parsimony code in locate_mutations_on_tree().

I have added in a test case, test_non_inference_samples, to help solve this.

hyanwong · 2019-08-31T13:32:10Z

@jeromekelleher I think this is ready to merge. It contains a few separate fixes, so we might want to split them apart. Quite a few are interdependent, however - e.g. upgrading to tskit 0.2.1 means fixing the uint8/int8 haplotype storage, and fixing the appveyor build, then dealing with the missing data in int8 haplotypes means fixing the map_mutations method, so I haven't tried to disentangle them all.

jeromekelleher · 2019-09-02T11:47:37Z

Can you bring this up to date please @hyanwong? Probably the simplest thing to do is rebase, dropping the commits before 47a0153. It's probably easiest to drop the last commit also, as appveyor has been fixed up on master. You may want to separate out the map_mutations code from missing data handling, since this is quite a bit change in itself worthy of its own PR.

hyanwong · 2019-09-02T13:40:31Z

I forgot that the map_mutations code needs committing before this will pass tests. Could you merge #185 first @jeromekelleher ?

jeromekelleher · 2019-09-02T13:53:37Z

Looks like they're interdependant @hyanwong --- see comments over in #185 for the options.

awohns · 2019-09-09T19:34:49Z

Just to confirm, this will handle missing data in samples, but not missing sites, correct? So if not all the samples have information at a given site this would work, but if the sample_data file does not contain all the sites in an ancestors_tree_sequence the code still would break.

hyanwong · 2019-09-09T19:53:48Z

Yes, comments in the function restore_tree_sequence_builder says:

        # Make sure that the set of positions in the ancestors tree sequence is
        # identical to the inference sites in the sample data file.

So I guess it's not set up for changing the inference sites between the ancestors TS and the match_samples() routine.

awohns · 2019-09-09T19:59:01Z

My work around for that check was to remove sites in the ancestors_ts which aren't also present in the sample_data. This passes the np.array_equal(position, sample_data_position) but will still break on
if np.any(pos_map[left] != edges.left): raise ValueError("Invalid left coordinates")

hyanwong · 2019-09-09T20:04:33Z

Yes, it's not so simple to change the inference sites in the middle of the pathway, as the indexes will be all up the spout. I'll open another issue about this.

hyanwong · 2020-02-10T13:26:03Z

The current implementation allows sites with missing data to be used for inference, but doesn't change the matching algorithm, which is why test_missing_inference_sites in tests/test_inference.py currently fails (NB: this is only an issue when matching samples with missing data). There's no point altering the current sample-matching algorithm to skip missing data sites, as this is going to be obsoleted by tskit-dev/tskit#452 and friends. So after discussion with @jeromekelleher, this PR is being punted down the line until after the generalised tskit L&S matching algorithm is incorporated into tsinfer.

All of the work, apart from the incorporation of missing data into matching, is now in this PR, so it should be relatively easy to squash all my commits and merge once the L&S plumbing is done.

…ples

Fixes tskit-dev#225

jeromekelleher · 2020-03-19T14:26:36Z

@hyanwong, I've rebased this and made minimal changes to make the tests pass. I suggest we merge this and then I'll finish up the process of implementing missing data.

hyanwong · 2020-03-19T15:36:08Z

Perfect, thanks

hyanwong mentioned this pull request Jul 21, 2019

Ancestral states when sample data is missing tskit-dev/tskit#270

Closed

hyanwong force-pushed the allow-missing-data-in-samples branch 4 times, most recently from a8ad239 to 7dd5061 Compare August 7, 2019 16:30

hyanwong force-pushed the allow-missing-data-in-samples branch 4 times, most recently from 4217438 to 9c26fc0 Compare August 9, 2019 16:27

hyanwong commented Aug 9, 2019

View reviewed changes

tsinfer/inference.py Outdated Show resolved Hide resolved

hyanwong force-pushed the allow-missing-data-in-samples branch 2 times, most recently from c493191 to 0622d85 Compare August 14, 2019 15:41

hyanwong force-pushed the allow-missing-data-in-samples branch from 0622d85 to fddef64 Compare August 31, 2019 12:28

hyanwong mentioned this pull request Aug 31, 2019

Map mutations for noninference sites #179

Closed

hyanwong force-pushed the allow-missing-data-in-samples branch from d3e7bdf to 942fa8d Compare September 2, 2019 13:23

hyanwong mentioned this pull request Sep 2, 2019

Switch to (signed) int8 for haplotype storage #168

Closed

hyanwong force-pushed the allow-missing-data-in-samples branch 2 times, most recently from 7835114 to 4beec4e Compare September 2, 2019 15:26

This was referenced Feb 7, 2020

Options for dealing with missing flanking data in samples #224

Open

Document that all missing data is imputed #225

Closed

Should probably switch to true frequencies for variant ages, rather than using counts #227

Closed

hyanwong force-pushed the allow-missing-data-in-samples branch from 4bf9085 to 478f05f Compare February 8, 2020 22:22

jeromekelleher mentioned this pull request Mar 18, 2020

Support missing data in LS copying process #248

Closed

hyanwong added 17 commits March 19, 2020 14:11

Initial pass at allowing missing data (particularly truncated) in sam…

d01dd1e

…ples

Add intersphinx mappings

dfea651

Remove comment and place in tskit-dev#226

798b9ca

Address PR comments

7fcbfa7

Document imputing everything

b7667c2

Fixes tskit-dev#225

Remove ability to output truncated inferences

99c51aa

Linting

a62b987

Test ancestor builders equal with missing

ee8e6e9

Make C ancestor builder equivalent

fd1bebf

Add test for all missing data at an adjacent site

0e58aa0

Account for missing data for consensus

f5f78df

Take account of missing data when breaking

19b4322

Make C ancestor builder the same as py

38cb683

Add between-sites missing data

55f7f41

Explicit exclusion for all missing sites

78800fa

Remove currently unused tsutil

d04fc50

Change inference tests to check missing data imputation

092e7cc

jeromekelleher force-pushed the allow-missing-data-in-samples branch from d645c63 to db7f360 Compare March 19, 2020 14:24

Fix up merge issues and disable test.

5bb2089

jeromekelleher force-pushed the allow-missing-data-in-samples branch from db7f360 to 5bb2089 Compare March 19, 2020 14:44

jeromekelleher merged commit cac9a35 into tskit-dev:master Mar 19, 2020

hyanwong deleted the allow-missing-data-in-samples branch March 19, 2020 15:42

Allow missing data in samples #169

Allow missing data in samples #169

Uh oh!

Conversation

hyanwong commented Jul 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented Jul 19, 2019

Uh oh!

codecov bot commented Jul 19, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

hyanwong commented Aug 7, 2019

Uh oh!

Uh oh!

hyanwong commented Aug 12, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hyanwong commented Aug 31, 2019

Uh oh!

jeromekelleher commented Sep 2, 2019

Uh oh!

hyanwong commented Sep 2, 2019

Uh oh!

jeromekelleher commented Sep 2, 2019

Uh oh!

awohns commented Sep 9, 2019

Uh oh!

hyanwong commented Sep 9, 2019

Uh oh!

awohns commented Sep 9, 2019

Uh oh!

hyanwong commented Sep 9, 2019

Uh oh!

hyanwong commented Feb 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jeromekelleher commented Mar 19, 2020

Uh oh!

hyanwong commented Mar 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hyanwong commented Jul 19, 2019 •

edited

Loading

codecov bot commented Jul 19, 2019 •

edited

Loading

hyanwong commented Aug 12, 2019 •

edited

Loading

hyanwong commented Feb 10, 2020 •

edited

Loading