Skip to content

Investigate ragged arrays to represent alleles #634

Open
@tomwhite

Description

@tomwhite

Alleles are a challenge to represent efficiently in fixed-length arrays. There are a couple of problems:

  1. the number of alleles is not known until the whole VCF file has been processed
  2. there can be a very wide variation in the number of alt alleles (most variants will have one, but a few could have thousands

Both these problems could be solved by using ragged arrays.

Zarr has support for ragged arrays, but these don't currently work with variable length strings (needed for alleles), and they don't fit the Xarray data model, which assumes fixed sized dimensions. There is a good discussion of the problem in pydata/xarray#4285, in the context of Awkward Array.

Metadata

Metadata

Assignees

No one assigned

    Labels

    data representationIssues related to how data is represented: data types, data structures, indexes, access methods, etc

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions