Open
Description
Alleles are a challenge to represent efficiently in fixed-length arrays. There are a couple of problems:
- the number of alleles is not known until the whole VCF file has been processed
- there can be a very wide variation in the number of alt alleles (most variants will have one, but a few could have thousands
Both these problems could be solved by using ragged arrays.
Zarr has support for ragged arrays, but these don't currently work with variable length strings (needed for alleles), and they don't fit the Xarray data model, which assumes fixed sized dimensions. There is a good discussion of the problem in pydata/xarray#4285, in the context of Awkward Array.