Skip to content

Ancestor generation requires entire genotype array to be in-memory. #806

Closed
@benjeffery

Description

@benjeffery

As currently implemented, AncestorsGenerator.add_sites loads the entire sample data genotype array into memory. As one of our currently intended inference targets has 1.8TB of genotypes this is not possible.

We can of course read the genotypes from the zarr-backed sample data as needed (in break_ancestor, make_ancestor and compute_ancestral_state) but care will need to be taken to have some kind of decoded chunk caching mechanism. I thought this might be simple to do with zarr, but zarr-developers/zarr-python#306 has been open for a long time.
A simple FIFO or LRU cache might not be too difficult to implement, as long as the cache is larger than the typical ancestor length chunks shouldn't need to be loaded more than once.

@savitakartik This is what is OOM killing your jobs!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions