Ancestor generation requires entire genotype array to be in-memory.

As currently implemented, `AncestorsGenerator.add_sites` loads the entire sample data genotype array into memory. As one of our currently intended inference targets has 1.8TB of genotypes this is not possible. 

We can of course read the genotypes from the zarr-backed sample data as needed (in `break_ancestor`, `make_ancestor` and `compute_ancestral_state`) but care will need to be taken to have some kind of decoded chunk caching mechanism. I thought this might be simple to do with `zarr`, but https://github.com/zarr-developers/zarr-python/pull/306 has been open for a long time.
A simple FIFO or LRU cache might not be too difficult to implement, as long as the cache is larger than the typical ancestor length chunks shouldn't need to be loaded more than once.

@savitakartik This is what is OOM killing your jobs!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Ancestor generation requires entire genotype array to be in-memory. #806

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Ancestor generation requires entire genotype array to be in-memory. #806

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions