Core operations in human GWAS workloads #696
hammer
started this conversation in
Discourse import
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
(Posted by @eric-czech)
This is a summary of the "core" functionality we've identified in human GWAS pipelines, along with the scaling characteristics of each class of operations (for
n
variants,m
samples):O(nm)
][O(nmd)]
d
indicates variant density measured as number of variants in fixed size bp windows (1000 kbp is common)O(n^2)
LD matrices are very difficult to compute for imputed datasets[O(nm^2)]
m^2
was not a problem in the past but biobanks are making it one now[O(nm)]
[O(nmk)]
k
principal components isO(nmk)
, randomized approaches can reduce further toO(nm log(k))
[O(nm)]
[O(nm^2)]
Notably, relatedness estimation / pruning is the most difficult of all operations above to scale because outside of using self-reported relatedness or external reference panels, it is the only one that is still superlinear in the number of variants or samples.
This list is based largely on the following:
Beta Was this translation helpful? Give feedback.
All reactions