Skip to content

Conversation

@hyanwong
Copy link
Member

@hyanwong hyanwong commented Jul 19, 2019

For brief review, but not merging yet. Addresses #153

@hyanwong
Copy link
Member Author

Ping @marianne-aspbury - this is not quite ready yet, as the Li & Stevens matching algorithm doesn't deal with missing data properly, but it's a good start, and the test suite passes. Most of the fundamental work is in algorithm.py

@codecov
Copy link

codecov bot commented Jul 19, 2019

Codecov Report

Merging #169 into master will decrease coverage by 0.07%.
The diff coverage is 93.61%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #169      +/-   ##
==========================================
- Coverage   91.86%   91.79%   -0.08%     
==========================================
  Files          15       15              
  Lines        4489     4523      +34     
  Branches      807      818      +11     
==========================================
+ Hits         4124     4152      +28     
- Misses        245      249       +4     
- Partials      120      122       +2     
Flag Coverage Δ
#C 91.79% <93.61%> (-0.08%) ⬇️
#python 95.00% <94.23%> (-0.08%) ⬇️
Impacted Files Coverage Δ
tsinfer/eval_util.py 87.84% <ø> (ø)
tsinfer/inference.py 98.27% <50.00%> (-0.47%) ⬇️
lib/ancestor_builder.c 87.79% <92.85%> (-0.34%) ⬇️
tsinfer/algorithm.py 98.28% <100.00%> (+<0.01%) ⬆️
tsinfer/formats.py 96.97% <100.00%> (+0.04%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e01b15f...5bb2089. Read the comment docs.

@hyanwong hyanwong force-pushed the allow-missing-data-in-samples branch 4 times, most recently from a8ad239 to 7dd5061 Compare August 7, 2019 16:30
@hyanwong
Copy link
Member Author

hyanwong commented Aug 7, 2019

NB: currently failing tests because it relies on tskit v0.2.0. Tests pass fine on my machine with the new tskit, so I'm simply waiting for the next tskit release before seeking review.

@hyanwong hyanwong force-pushed the allow-missing-data-in-samples branch 4 times, most recently from 4217438 to 9c26fc0 Compare August 9, 2019 16:27
@hyanwong
Copy link
Member Author

hyanwong commented Aug 12, 2019

Currently the locate_mutations_on_tree() functionality is failing on some inferred tree sequences, because samples that are missing all inference sites get added to a new "artificial root", which is not attached to anywhere else. If the other samples present at that genomic location already fall in a tree, this creates a location with 2 roots, which isn't properly handled by the parsimony code in locate_mutations_on_tree().

I have added in a test case, test_non_inference_samples, to help solve this.

@hyanwong hyanwong force-pushed the allow-missing-data-in-samples branch 2 times, most recently from c493191 to 0622d85 Compare August 14, 2019 15:41
@hyanwong hyanwong force-pushed the allow-missing-data-in-samples branch from 0622d85 to fddef64 Compare August 31, 2019 12:28
@hyanwong
Copy link
Member Author

@jeromekelleher I think this is ready to merge. It contains a few separate fixes, so we might want to split them apart. Quite a few are interdependent, however - e.g. upgrading to tskit 0.2.1 means fixing the uint8/int8 haplotype storage, and fixing the appveyor build, then dealing with the missing data in int8 haplotypes means fixing the map_mutations method, so I haven't tried to disentangle them all.

@jeromekelleher
Copy link
Member

Can you bring this up to date please @hyanwong? Probably the simplest thing to do is rebase, dropping the commits before 47a0153. It's probably easiest to drop the last commit also, as appveyor has been fixed up on master. You may want to separate out the map_mutations code from missing data handling, since this is quite a bit change in itself worthy of its own PR.

@hyanwong hyanwong force-pushed the allow-missing-data-in-samples branch from d3e7bdf to 942fa8d Compare September 2, 2019 13:23
@hyanwong
Copy link
Member Author

hyanwong commented Sep 2, 2019

I forgot that the map_mutations code needs committing before this will pass tests. Could you merge #185 first @jeromekelleher ?

@jeromekelleher
Copy link
Member

Looks like they're interdependant @hyanwong --- see comments over in #185 for the options.

@hyanwong hyanwong force-pushed the allow-missing-data-in-samples branch 2 times, most recently from 7835114 to 4beec4e Compare September 2, 2019 15:26
@awohns
Copy link
Member

awohns commented Sep 9, 2019

Just to confirm, this will handle missing data in samples, but not missing sites, correct? So if not all the samples have information at a given site this would work, but if the sample_data file does not contain all the sites in an ancestors_tree_sequence the code still would break.

@hyanwong
Copy link
Member Author

hyanwong commented Sep 9, 2019

Yes, comments in the function restore_tree_sequence_builder says:

        # Make sure that the set of positions in the ancestors tree sequence is
        # identical to the inference sites in the sample data file.

So I guess it's not set up for changing the inference sites between the ancestors TS and the match_samples() routine.

@awohns
Copy link
Member

awohns commented Sep 9, 2019

My work around for that check was to remove sites in the ancestors_ts which aren't also present in the sample_data. This passes the np.array_equal(position, sample_data_position) but will still break on
if np.any(pos_map[left] != edges.left): raise ValueError("Invalid left coordinates")

@hyanwong
Copy link
Member Author

hyanwong commented Sep 9, 2019

Yes, it's not so simple to change the inference sites in the middle of the pathway, as the indexes will be all up the spout. I'll open another issue about this.

@hyanwong
Copy link
Member Author

hyanwong commented Feb 10, 2020

The current implementation allows sites with missing data to be used for inference, but doesn't change the matching algorithm, which is why test_missing_inference_sites in tests/test_inference.py currently fails (NB: this is only an issue when matching samples with missing data). There's no point altering the current sample-matching algorithm to skip missing data sites, as this is going to be obsoleted by tskit-dev/tskit#452 and friends. So after discussion with @jeromekelleher, this PR is being punted down the line until after the generalised tskit L&S matching algorithm is incorporated into tsinfer.

All of the work, apart from the incorporation of missing data into matching, is now in this PR, so it should be relatively easy to squash all my commits and merge once the L&S plumbing is done.

@jeromekelleher jeromekelleher force-pushed the allow-missing-data-in-samples branch from d645c63 to db7f360 Compare March 19, 2020 14:24
@jeromekelleher
Copy link
Member

@hyanwong, I've rebased this and made minimal changes to make the tests pass. I suggest we merge this and then I'll finish up the process of implementing missing data.

@jeromekelleher jeromekelleher force-pushed the allow-missing-data-in-samples branch from db7f360 to 5bb2089 Compare March 19, 2020 14:44
@hyanwong
Copy link
Member Author

Perfect, thanks

@jeromekelleher jeromekelleher merged commit cac9a35 into tskit-dev:master Mar 19, 2020
@hyanwong hyanwong deleted the allow-missing-data-in-samples branch March 19, 2020 15:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants