Description
I would appreciate to have more control over the way Kerchunk is writing "refs" -especially control the chunking.
Context:
I previously used fsspec and kerchunk to store my data while continously exanding my dataset.
My data always has the same dimensionality of course and even the same coordinetes except one dimension: "release"
When using class SingleHdf5ToZarr
in Kerchunk I have no control over the zarr group/store created
I think it is because this part of the init is hardcoded and not mutable through any methods:
self.store = {}
self._zroot = zarr.group(store=self.store, overwrite=True)
My data files have the same coordinate size : datafile_shape=(1,n2,n3,n4)
When running SingleHdf5ToZarr(...).translate() on my old data, I get back data with some arbitrary chunksize (1,n_c2,n_c3,n_c4)
Now that I have updated some dependencies in my env I get another arbitrary chunksize (1,n_c2',n_c3',n_c4')
Here I would actually ideally just have had chunksize = datafile_shape. But the fatal issue is that I can no longer combine new and old data with MultiZarrToZarr. When I try to combine my kerchunk metadata chunks I get:
ValueError: Found chunk size mismatch:
at prefix [my variable name] in iteration 544 (file None)
new chunk: [1, 63, 200, 261]
chunks so far: [1, 42, 133, 174]
Problem in short:
- I have one distributed dataset, that is constantly expanding. Old data is no longer compatible with new data, because my env has changed slightly.
- I experience the arb. chunksize from Kerchunk to deviate between Old and new data, disabling me to combine my dataset properly.
- Ideally I would like to choose the chunksize explicitly like I can in xarray. Then I would prefer my chunksizes to be the same as my datafiles shapes