-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
document how to compute "per-variant-site" instead of "per-site" statistics #1914
Comments
Also we might want to point out that even with |
We need to make a distinction between "site" (entry in the site table, possibly with mutations, usually variable but might not be) and "position on the genome" (in the gaps between sites). Which one are you talking about @hyanwong? We can't call them "variable sites" because they're not necessarily variable and that would be even more confusing. |
I mean that our current stats, with |
This is (a) pretty straightforward and (b) rather confusing because of polyallelic sites. To do a "per variant" statistic we just need to do:
If all sites are biallelic then this does exactly what you're saying, Yan: it doesn't count sites that are not polymorphic in the sample sets. The only confusing thing is for more-than-two-allelic sites: if a site has 3 alleles present in the sample sets then the site counts twice. But, this is actually correct, because that's what |
It'd been pointed out that other tools like
report statistics (eg f3) calculated from only polymorphic sites instead of the whole genome as we do. We should clarify this in the docs (particularly for f2/f3/f4 maybe) and provide an example somewhere of doing the conversion.
The text was updated successfully, but these errors were encountered: