Open
Description
While experimenting with kerchunking some Icesat2 ATL08 data I noticed an issue where using MultiZarrToZarr
with non-dimension coordinates that had partial chunks resulted in empty values for those variables in the ouput kerchunk index.
A minimal example
import xarray as xr
import numpy as np
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
# Set up fake netCDF files with latitude as a non-dimension coordinate.
time_1_3 = xr.Dataset(
{
"delta_time": [12, 13, 14]
},
coords={
'latitude': ('delta_time', [1, 2, 3]),
})
time_15_17 = xr.Dataset(
{
"delta_time": [15, 16, 17]
},
coords={
'latitude': ('delta_time', [4, 5, 6]),
})
time_18_22 = xr.Dataset(
{
"delta_time": [18, 19, 20, 21, 22]
},
coords={
'latitude': ('delta_time', [7, 8, 9, 10, 11]),
})
time_1_3
# Create netCDFs with chunksize aligned to data size
chunksize = 3
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_1_3.to_netcdf('time_1_3.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
def create_reference(files: list[str]):
single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() for filepath in files]
mzz = MultiZarrToZarr(
single_jsons,
concat_dims=["delta_time"],
)
combined_test_json = mzz.translate()
combined_test = xr.open_dataset(
"reference://", engine="zarr",
backend_kwargs={
"storage_options": {
"fo": combined_test_json,
},
"consolidated": False,
}
)
return combined_test
# This works as expected
combined_test_3 = create_reference(["time_1_3.nc", "time_15_17.nc"])
combined_test_3.latitude.data
# Create netCDFs with a larger chunksize resulting in partially filled chunks not aligned to data size
chunksize = 10
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_18_22.to_netcdf('time_18_22.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
# This results in empty values for the resulting non-dimension coordinate variable.
combined_test_10 = create_reference(["time_15_17.nc", "time_18_22.nc"])
combined_test_10.latitude.data
This seems potentially related to some of the discussion in #305 (as it is also describing the case of data not aligned with chunk size).
If latitude
is promoted to a concat_dim
the output is correct (with all of the latitude
values included).
I may be misunderstanding the MultiZarrToZarr
logic in this case where we have regularly sized, partially filled chunks. Is it possible to have a non-dimension variable concatenated in a linear fashion in this situation?
Metadata
Metadata
Assignees
Labels
No labels