Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

document how to compute "per-variant-site" instead of "per-site" statistics #1914

Open
petrelharp opened this issue Nov 15, 2021 · 4 comments
Labels

Comments

@petrelharp
Copy link
Contributor

It'd been pointed out that other tools like

  • ADMIXTOOLS (or, the frontend admixr)
  • treemix

report statistics (eg f3) calculated from only polymorphic sites instead of the whole genome as we do. We should clarify this in the docs (particularly for f2/f3/f4 maybe) and provide an example somewhere of doing the conversion.

@jeromekelleher jeromekelleher added the documentation Documentation label Nov 15, 2021
@hyanwong
Copy link
Member

Also we might want to point out that even with mode="site" this would (I believe) compute the stats at each site defined in the TS, regardless of whether that site was actually polymorphic or not. I wonder if it is worth having a mode="variable_site" setting, or whether this would just be confusing?

@jeromekelleher
Copy link
Member

We need to make a distinction between "site" (entry in the site table, possibly with mutations, usually variable but might not be) and "position on the genome" (in the gaps between sites). Which one are you talking about @hyanwong? We can't call them "variable sites" because they're not necessarily variable and that would be even more confusing.

@hyanwong
Copy link
Member

hyanwong commented Nov 17, 2021

I mean that our current stats, with mode="site", report stats for all sites defined in the sites table (I think). If other tools report statistics only at polymorphic sites, then they will not necessarily give the same answer, because even if a site is in the sites table, it is not guaranteed to be polymorphic. So perhaps we additionally want a mode="polymorphic/variable site" which returns stats only at explicitly variable sites? It presumably makes no difference if we are averaging along the genome, but maybe there are circumstances where it does?

@petrelharp
Copy link
Contributor Author

This is (a) pretty straightforward and (b) rather confusing because of polyallelic sites. To do a "per variant" statistic we just need to do:

 ts.f4(sample_sets, windows) / ts.segregating_sites(union_of_sample_sets, windows)

If all sites are biallelic then this does exactly what you're saying, Yan: it doesn't count sites that are not polymorphic in the sample sets. The only confusing thing is for more-than-two-allelic sites: if a site has 3 alleles present in the sample sets then the site counts twice. But, this is actually correct, because that's what ts.f4( ) does also: it computes f4 for each allele (i.e., for presence/absence of each allele). And, to my knowledge no other software does these statistics for polyallelic sites at all, so there's no worries about agreeing with their definitions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants