Use imputation accuracy to determine the correct mismatch ratios

We currently have no simple objective way to determine what mismatch ratios (in `match_samples` and `match_ancestors`) to use in real data. This is quite limiting when building tree sequences from real data.

However, a natural metric to use is to look at the accuracy of imputation (e.g. as currently being worked on by @szhan). This would be an excellent way to determine the correct mismatch parameters to use. Here's how I think we would do it:

Take a real dataset, and mark (say) 1% of the genotype data to be missing, at random. Run `tsinfer` under different mismatch ratios (e.g. as [here](https://github.com/tskit-dev/tsinfer/discussions/563). Impute the missing data and calculate how well we do at imputation. Repeat for different random selections of 1% missing.

It would be helpful to do this both for different chromosomes, and for different datasets (e.g. separately on 1000G, HGDP, and SGDP). Hopefully the different chromosomes would show the same pattern. 

It would also be interesting to see how we do for "weird" regions of the genome, such as the MHC.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use imputation accuracy to determine the correct mismatch ratios #652

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Use imputation accuracy to determine the correct mismatch ratios #652

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions