Skip to content

Control chunksize of the underlying zarrdata #406

Open
@chc-incom

Description

@chc-incom

I would appreciate to have more control over the way Kerchunk is writing "refs" -especially control the chunking.

Context:
I previously used fsspec and kerchunk to store my data while continously exanding my dataset.
My data always has the same dimensionality of course and even the same coordinetes except one dimension: "release"

When using class SingleHdf5ToZarr in Kerchunk I have no control over the zarr group/store created
I think it is because this part of the init is hardcoded and not mutable through any methods:

        self.store = {}
        self._zroot = zarr.group(store=self.store, overwrite=True)

My data files have the same coordinate size : datafile_shape=(1,n2,n3,n4)

When running SingleHdf5ToZarr(...).translate() on my old data, I get back data with some arbitrary chunksize (1,n_c2,n_c3,n_c4)

Now that I have updated some dependencies in my env I get another arbitrary chunksize (1,n_c2',n_c3',n_c4')

Here I would actually ideally just have had chunksize = datafile_shape. But the fatal issue is that I can no longer combine new and old data with MultiZarrToZarr. When I try to combine my kerchunk metadata chunks I get:

ValueError: Found chunk size mismatch:
                        at prefix [my variable name] in iteration 544 (file None)
                        new chunk: [1, 63, 200, 261]
                        chunks so far: [1, 42, 133, 174]

Problem in short:

  • I have one distributed dataset, that is constantly expanding. Old data is no longer compatible with new data, because my env has changed slightly.
  • I experience the arb. chunksize from Kerchunk to deviate between Old and new data, disabling me to combine my dataset properly.
  • Ideally I would like to choose the chunksize explicitly like I can in xarray. Then I would prefer my chunksizes to be the same as my datafiles shapes

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions