Skip to content

An alternative approach to variable-length chunks: array concatenation within Zarr #2536

Open
@rabernat

Description

@rabernat

We've discussed variable-length chunks at length (🥁🤣) in many places, and have an implementation based on ZEP3 in a PR #1595. While the ZEP / extension process remains broken, I'd like to propose we experiment with some implementations of different approaches to the problem within Zarr Python. I'm deliberately avoiding framing this discussion in terms of how should we change / extend the spec and instead framing it in terms of how can we solve our problem, with the idea that the spec would follow. If we like this idea, we can try to just implement it and make it work first, and then we can figure out what sort of extension or spec might be required.

The main reason that many people in our community want variable length chunks is so we can use them together with kerchunk / virtualizarr to create virtual Zarr data on top of existing HDF5 files. However, HDF5 files, like Zarr, also only support fixed-shape chunks. So why do we need variable length chunks?

Because, in addition to virtualizing a single file, we are also concatenating many individual files into one virtual Array.

So what if we just built the ability to address multiple arrays within a Zarr group as a single array?

Imagine we have two arrays array-0 and array-1 of the same dtype and concatenate-able shapes, e.g. (2, 50, 10) and (3, 50, 10). And these arrays are stored in the same group:

my-store/
├─ my-array-group/
│  ├─ zarr.json
│  ├─ array-0/
│  |  ├─ zarr.json
│  |  ├─ c/0/0
│  |  ├─ c/0/2
│  ├─ array-1/
│  |  ├─ zarr.json
│  |  ├─ c/0/0
│  |  ├─ c/0/2
│  |  ├─ c/0/3

We can concatenate them along axis 0 and obtain an array of shape (5, 50, 10). This can happen completely lazily. All that's required is a bit of indexing logic to map requests to the concatenated array to the appropriate indexes within the sub arrays. I'm sure many of us have written code like this already. That code could live in Zarr Python as a new type of Array, implementing mostly the same interface as our existing Array. (Is it writeable? I don't see why it can't be tbh.) Concatenation could be done for any Zarr arrays that have compatible dtypes and shapes. It's completely lazy and cheap.

The next step is to define some metadata in my-array-group that tells us to do this automatically for the arrays in the group. For example, something like:

{
  "extensions": {
    "name": "concatenation",
    "configuration": {
      "axis": 0,
      "arrays": ["array-0", "array-1"]
    }
  }
}

With an appropriate configuration system and API, when opening such a group, we could automatically present it as an Array to downstream libraries, e.g. Xarray.

There are many variants of this metadata, and many questions about how it should work:

  • Do we explicitly list all the array names we want to concat, as in my approach?
  • Or do we just a naming convention for the arrays themselves, like we do for chunks?
  • Or do we use a template approach, like "array_name_template": "array-{i}"
  • Do we store information about the dtype, shape, etc in this metadata?

Compared to variable-length chunks, this proposal operates at a higher level of abstraction. The advantages over variable length chunks are:

  • No change required to array spec itself
  • Less metadata required to keep track of (enumerating the length of each chunk could be expensive)
  • Can concatenate arrays with different codecs and potentially different dtypes into one array
  • Potentially supports multidimensional concatenation
  • The sub arrays are just regular Zarr arrays, creating a falllback mechanism for implementations that don't support concatenation.

The main downsides I see are:

  • Array shape is implicit, requires reading all of the sub-array metadata
  • Indexing into the data is more expensive because it requires more logic and metadata reads to figure out where the data actually live

The ability to quickly load lots of metadata for different arrays, either through async requests, consolidated metadata, or Icechunk, mitigates these issues considerably for high-latency stores.

For a scenario like zarr-developers/VirtualiZarr#330, we wouldn't necessarily have to do one file : one array. We could bundle as many regularly-sized chunks as we can into a single virtual array, and then when we hit a chunk discontinuity, start a new virtual array. It's very flexible.

What I like about this proposal is that we can implement the hard part--the actual concatenation logic--without any spec changes at all. We could manually concatenate arrays, check the performance, get comfortable with the tradeoffs, etc. before doing any spec work.

cc @TomNicholas, @abarciauskas-bgse.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions