Skip to content

fsspec v2022.10.0 breaks MultiZarrToZarr #246

Closed
fsspec/filesystem_spec
#1087
@arongergely

Description

@arongergely

My environment:
Ubuntu Jammy, python 3.8

I am creating virtual zarr files from NetCDF4s and unifying them into a single virtual Zarr as follows:

import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr


def create_virtual_zarr(nc_filepath):
    """
    Parses NetCDF4 file into a virtual Zarr.
    """
    open_args = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first')
    with fsspec.open(nc_filepath, **open_args) as f:
        h5chunks = SingleHdf5ToZarr(f, nc_filepath, inline_threshold=-1)
        single_file_json = h5chunks.translate()
    return single_file_json


filepaths = ['./file1.nc', './file2.nc', './file3.nc']

#create virtual zarrs
virtual_zarrs = list(map(create_virtual_zarr, filepaths))

# combine all virtual Zarr files into one
single_virtual_zarr = MultiZarrToZarr(
    virtual_zarrs,
    remote_protocol='file',
    concat_dims=['time'],
    identical_dims=['y', 'x']
).translate()

This worked with with kerchunk v.0.0.9 and ffspec 2022.8.2.

However when upgrading fsspec to 2022.10.0 I get an error when calling MultiZarrToZarr(...).translate():

Traceback (most recent call last):
  File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 635, in <module>
    main()
  File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 632, in main
    forecast_statistics(None, proc_params)
  File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 373, in forecast_statistics
    single_virtual_zarr = MultiZarrToZarr(
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/kerchunk/combine.py", line 468, in translate
    self.second_pass()
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/kerchunk/combine.py", line 452, in second_pass
    bits = fs.cat(list(to_download.values()))
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/implementations/reference.py", line 345, in cat
    bytes_out = fs.cat_ranges(new_paths, new_starts, new_ends)
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/spec.py", line 765, in cat_ranges
    out.append(self.cat_file(p, s, e))
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/spec.py", line 719, in cat_file
    return f.read(end - f.tell())
  File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/implementations/local.py", line 337, in read
    return self.f.read(*args, **kwargs)
ValueError: read length must be non-negative or -1

Process finished with exit code 1

Dove into the code and the culprit seems to be some changes to ReferenceFileSystem in fsspec : fsspec/filesystem_spec#1063

When ReferenceFileSystem.cat() gets called within MultiZarrToZarr.second_pass() , the start/end positions of datasets are not preserved correctly, leading to some 0 length data.

During debug the erratic starts/ends were introduced by this subroutine due to a sorting error https://github.com/fsspec/filesystem_spec/blob/2022.10.0/fsspec/implementations/reference.py#L337-L344

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions