Description
I've noticed that count_call_alleles
and some methods using cohorts create unnecessary dependencies between chunks in the variants dimension. For example the current task graph for observed_heterozygosity
on a dataset with 10 chunks in the variants dimentions looks like this:
In count_call_alleles
this is a result of using da.empty
to indicate the number of alleles for a gufunc. In observed_heterozygosity
(and also diversity
) this is caused by forcing the sample_cohort
array to be a dask array which doesn't achieve much because we immediately call compute on that array to get the number of cohorts. Replacing both of these cases with numpy arrays results in the following equivalent task graph:
My understanding is that the second task graph should be more efficient to schedule at larger scales (can any dask experts confirm?). Is there any reason not to make such a change? I guess it makes the use of the sample_cohort
array a little bit opaque.