Skip to content

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

Closed
@timothymillar

Description

@timothymillar

I've noticed that count_call_alleles and some methods using cohorts create unnecessary dependencies between chunks in the variants dimension. For example the current task graph for observed_heterozygosity on a dataset with 10 chunks in the variants dimentions looks like this:

Task graph

obshet_old

In count_call_alleles this is a result of using da.empty to indicate the number of alleles for a gufunc. In observed_heterozygosity (and also diversity) this is caused by forcing the sample_cohort array to be a dask array which doesn't achieve much because we immediately call compute on that array to get the number of cohorts. Replacing both of these cases with numpy arrays results in the following equivalent task graph:

Task graph

obshet_new

My understanding is that the second task graph should be more efficient to schedule at larger scales (can any dask experts confirm?). Is there any reason not to make such a change? I guess it makes the use of the sample_cohort array a little bit opaque.

Metadata

Metadata

Assignees

No one assigned

    Labels

    questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions