Skip to content

Track and improve the performance of allele counting method #49

Open
@eric-czech

Description

@eric-czech

The solution to https://github.com/pystatgen/sgkit/issues/3 in https://github.com/pystatgen/sgkit/pull/36 is naive and possibly unacceptably slow. This will be true if Dask does not optimize the loop over allele indexes to a single pass on the genotypes array (which it probably won't).

The extension to this proposed in https://github.com/pystatgen/sgkit/pull/36#issuecomment-656611356 would definitely solve the problem in a single pass if Dask supported counting rows like numpy does, but it currently doesn't.

There may be some other efficient ways to do it without dropping down to writing custom kernels but in any case, we should track the performance of this implementation (and others) as part of a benchmark suite like @alimanfoo mentioned in https://github.com/pystatgen/sgkit/pull/36#issuecomment-658893949 so we can measure the impact of future iterations more passively and prevent regressions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    core operationsIssues related to domain-specific functionality such as LD pruning, PCA, association testing, etc.performance

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions