Description
As currently implemented, AncestorsGenerator.add_sites
loads the entire sample data genotype array into memory. As one of our currently intended inference targets has 1.8TB of genotypes this is not possible.
We can of course read the genotypes from the zarr-backed sample data as needed (in break_ancestor
, make_ancestor
and compute_ancestral_state
) but care will need to be taken to have some kind of decoded chunk caching mechanism. I thought this might be simple to do with zarr
, but zarr-developers/zarr-python#306 has been open for a long time.
A simple FIFO or LRU cache might not be too difficult to implement, as long as the cache is larger than the typical ancestor length chunks shouldn't need to be loaded more than once.
@savitakartik This is what is OOM killing your jobs!