Skip to content

Unexpected behavior with MultiZarrToZarr with partial chunks. #400

Open
@sharkinsspatial

Description

@sharkinsspatial

While experimenting with kerchunking some Icesat2 ATL08 data I noticed an issue where using MultiZarrToZarr with non-dimension coordinates that had partial chunks resulted in empty values for those variables in the ouput kerchunk index.

A minimal example

import xarray as xr
import numpy as np
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr

# Set up fake netCDF files with latitude as a non-dimension coordinate.
time_1_3 = xr.Dataset(
    {
        "delta_time": [12, 13, 14]
    },
    coords={
        'latitude': ('delta_time', [1, 2, 3]),
 })
time_15_17 = xr.Dataset(
    {
        "delta_time": [15, 16, 17]
    },
    coords={
        'latitude': ('delta_time', [4, 5, 6]),
})
time_18_22 = xr.Dataset(
    {
        "delta_time": [18, 19, 20, 21, 22]
    },
    coords={
        'latitude': ('delta_time', [7, 8, 9, 10, 11]),
 })
time_1_3

Screen Shot 2023-11-28 at 4 21 41 PM

# Create netCDFs with chunksize aligned to data size
chunksize = 3
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_1_3.to_netcdf('time_1_3.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

def create_reference(files: list[str]):
    single_jsons = [SingleHdf5ToZarr(filepath, inline_threshold=0).translate() for filepath in files]
    mzz = MultiZarrToZarr(
        single_jsons,
        concat_dims=["delta_time"],
    )
    combined_test_json = mzz.translate()

    combined_test = xr.open_dataset(
        "reference://", engine="zarr",
        backend_kwargs={
            "storage_options": {
                "fo": combined_test_json,
                },
            "consolidated": False,
        }
    )
    return combined_test

# This works as expected
combined_test_3 = create_reference(["time_1_3.nc", "time_15_17.nc"])
combined_test_3.latitude.data

Screen Shot 2023-11-28 at 4 22 41 PM

# Create netCDFs with a larger chunksize resulting in partially filled chunks not aligned to data size
chunksize = 10
encoding = {"latitude": {"chunksizes": (chunksize,)},"delta_time": {"chunksizes": (chunksize,)}}
time_15_17.to_netcdf('time_15_17.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])
time_18_22.to_netcdf('time_18_22.nc', encoding=encoding, engine="h5netcdf", unlimited_dims=["delta_time"])

# This results in empty values for the resulting non-dimension coordinate variable.
combined_test_10 = create_reference(["time_15_17.nc", "time_18_22.nc"])
combined_test_10.latitude.data

Screen Shot 2023-11-28 at 4 25 03 PM

This seems potentially related to some of the discussion in #305 (as it is also describing the case of data not aligned with chunk size).

If latitude is promoted to a concat_dim the output is correct (with all of the latitude values included).

I may be misunderstanding the MultiZarrToZarr logic in this case where we have regularly sized, partially filled chunks. Is it possible to have a non-dimension variable concatenated in a linear fashion in this situation?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions