Description
My environment:
Ubuntu Jammy, python 3.8
I am creating virtual zarr files from NetCDF4s and unifying them into a single virtual Zarr as follows:
import fsspec
from kerchunk.hdf import SingleHdf5ToZarr
from kerchunk.combine import MultiZarrToZarr
def create_virtual_zarr(nc_filepath):
"""
Parses NetCDF4 file into a virtual Zarr.
"""
open_args = dict(mode='rb', anon=True, default_fill_cache=False, default_cache_type='first')
with fsspec.open(nc_filepath, **open_args) as f:
h5chunks = SingleHdf5ToZarr(f, nc_filepath, inline_threshold=-1)
single_file_json = h5chunks.translate()
return single_file_json
filepaths = ['./file1.nc', './file2.nc', './file3.nc']
#create virtual zarrs
virtual_zarrs = list(map(create_virtual_zarr, filepaths))
# combine all virtual Zarr files into one
single_virtual_zarr = MultiZarrToZarr(
virtual_zarrs,
remote_protocol='file',
concat_dims=['time'],
identical_dims=['y', 'x']
).translate()
This worked with with kerchunk
v.0.0.9 and ffspec
2022.8.2.
However when upgrading fsspec
to 2022.10.0 I get an error when calling MultiZarrToZarr(...).translate()
:
Traceback (most recent call last):
File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 635, in <module>
main()
File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 632, in main
forecast_statistics(None, proc_params)
File "/home/aron/repos/ensemble-statistics/wrf_deterministic_spatial_statistics.py", line 373, in forecast_statistics
single_virtual_zarr = MultiZarrToZarr(
File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/kerchunk/combine.py", line 468, in translate
self.second_pass()
File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/kerchunk/combine.py", line 452, in second_pass
bits = fs.cat(list(to_download.values()))
File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/implementations/reference.py", line 345, in cat
bytes_out = fs.cat_ranges(new_paths, new_starts, new_ends)
File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/spec.py", line 765, in cat_ranges
out.append(self.cat_file(p, s, e))
File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/spec.py", line 719, in cat_file
return f.read(end - f.tell())
File "/home/aron/miniconda3/envs/ensemble-statistics/lib/python3.8/site-packages/fsspec/implementations/local.py", line 337, in read
return self.f.read(*args, **kwargs)
ValueError: read length must be non-negative or -1
Process finished with exit code 1
Dove into the code and the culprit seems to be some changes to ReferenceFileSystem
in fsspec
: fsspec/filesystem_spec#1063
When ReferenceFileSystem.cat()
gets called within MultiZarrToZarr.second_pass()
, the start/end positions of datasets are not preserved correctly, leading to some 0 length data.
During debug the erratic starts/ends were introduced by this subroutine due to a sorting error https://github.com/fsspec/filesystem_spec/blob/2022.10.0/fsspec/implementations/reference.py#L337-L344