-
Notifications
You must be signed in to change notification settings - Fork 16
Description
We currently have no simple objective way to determine what mismatch ratios (in match_samples
and match_ancestors
) to use in real data. This is quite limiting when building tree sequences from real data.
However, a natural metric to use is to look at the accuracy of imputation (e.g. as currently being worked on by @szhan). This would be an excellent way to determine the correct mismatch parameters to use. Here's how I think we would do it:
Take a real dataset, and mark (say) 1% of the genotype data to be missing, at random. Run tsinfer
under different mismatch ratios (e.g. as here. Impute the missing data and calculate how well we do at imputation. Repeat for different random selections of 1% missing.
It would be helpful to do this both for different chromosomes, and for different datasets (e.g. separately on 1000G, HGDP, and SGDP). Hopefully the different chromosomes would show the same pattern.
It would also be interesting to see how we do for "weird" regions of the genome, such as the MHC.