Skip to content

Use imputation accuracy to determine the correct mismatch ratios #652

@hyanwong

Description

@hyanwong

We currently have no simple objective way to determine what mismatch ratios (in match_samples and match_ancestors) to use in real data. This is quite limiting when building tree sequences from real data.

However, a natural metric to use is to look at the accuracy of imputation (e.g. as currently being worked on by @szhan). This would be an excellent way to determine the correct mismatch parameters to use. Here's how I think we would do it:

Take a real dataset, and mark (say) 1% of the genotype data to be missing, at random. Run tsinfer under different mismatch ratios (e.g. as here. Impute the missing data and calculate how well we do at imputation. Repeat for different random selections of 1% missing.

It would be helpful to do this both for different chromosomes, and for different datasets (e.g. separately on 1000G, HGDP, and SGDP). Hopefully the different chromosomes would show the same pattern.

It would also be interesting to see how we do for "weird" regions of the genome, such as the MHC.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions