Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage

I've noticed that `count_call_alleles` and some methods using cohorts create unnecessary dependencies between chunks in the variants dimension. For example the current task graph for `observed_heterozygosity` on a dataset with 10 chunks in the variants dimentions looks like this:
<details>
<summary>Task graph</summary>
<br>

![obshet_old](https://user-images.githubusercontent.com/14065102/177886561-2fed2c6b-7c02-4599-9db1-ab1deeacae1c.png)

</details>



In `count_call_alleles` this is a result of using `da.empty` to indicate the number of alleles for a gufunc. In `observed_heterozygosity` (and also `diversity`) this is caused by forcing the `sample_cohort` array to be a dask array which doesn't achieve much because we immediately call compute on that array to get the number of cohorts. Replacing both of these cases with numpy arrays results in the following equivalent task graph:
<details>
<summary>Task graph</summary>
<br>

![obshet_new](https://user-images.githubusercontent.com/14065102/177887189-41e8c297-8e76-47c8-8142-276530617821.png)

</details>

My understanding is that the second task graph should be more efficient to schedule at larger scales (can any dask experts confirm?). Is there any reason not to make such a change? I guess it makes the use of the `sample_cohort` array a little bit opaque.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Avoid dependencies between chunks in variants dimensions caused by unnecessary dask usage #871

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions