-
Notifications
You must be signed in to change notification settings - Fork 22
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
xarray.merge virtual datasets fails because of missing chunk managers #141
Comments
Thanks for raising this @ghidalgo3 ! And for trying out this package :) A minimal reproducible example would be useful if you're up for that. Is this the same error as #114? If so then You also might need to pass You've made me realise that using |
Yes, I didn't know where to get those daymet files publicly but @TomAugspurger helped me out. Sorry! Here is a quick runnable repro: import virtualizarr
import xarray as xr
def open_daymet_dataset(path) -> xr.Dataset:
print("Opening", path)
return virtualizarr.open_virtual_dataset(
path,
filetype=virtualizarr.kerchunk.FileType.netcdf4,
drop_variables=["lambert_conformal_conic"],
reader_options={})
files = [
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_dayl_1980.nc",
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_dayl_1981.nc",
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_prcp_1980.nc",
"https://thredds.daac.ornl.gov/thredds/fileServer/ornldaac/2129/daymet_v4_daily_hi_prcp_1981.nc",
]
datasets = [open_daymet_dataset(path) for path in files]
print("concating")
dayl_concat = xr.concat([datasets[0], datasets[1]], dim="time")
prcp_concat = xr.concat([datasets[2], datasets[3]], dim="time")
print("merging")
merged = xr.merge([dayl_concat, prcp_concat]) I don't think it's the same as #114, because if I add The catalog for these files is here, they are pretty small files, like 4MB each. |
I tried all 3 suggestions (
|
So in theory you can actually virtualize and combine all the data in this archive with just a few lines: (I blame kerchunk / thredds for the slowness of the But it looks like for this data you will run into #5: ---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[7], line 1
----> 1 combined = xr.combine_nested(
2 vds_grid,
3 concat_dim=['time', None],
4 coords='minimal',
5 compat='override',
6 )
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:577, in combine_nested(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
574 concat_dim = [concat_dim]
576 # The IDs argument tells _nested_combine that datasets aren't yet sorted
--> 577 return _nested_combine(
578 datasets,
579 concat_dims=concat_dim,
580 compat=compat,
581 data_vars=data_vars,
582 coords=coords,
583 ids=False,
584 fill_value=fill_value,
585 join=join,
586 combine_attrs=combine_attrs,
587 )
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:356, in _nested_combine(datasets, concat_dims, compat, data_vars, coords, ids, fill_value, join, combine_attrs)
353 _check_shape_tile_ids(combined_ids)
355 # Apply series of concatenate or merge operations along each dimension
--> 356 combined = _combine_nd(
357 combined_ids,
358 concat_dims,
359 compat=compat,
360 data_vars=data_vars,
361 coords=coords,
362 fill_value=fill_value,
363 join=join,
364 combine_attrs=combine_attrs,
365 )
366 return combined
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:232, in _combine_nd(combined_ids, concat_dims, data_vars, coords, compat, fill_value, join, combine_attrs)
228 # Each iteration of this loop reduces the length of the tile_ids tuples
229 # by one. It always combines along the first dimension, removing the first
230 # element of the tuple
231 for concat_dim in concat_dims:
--> 232 combined_ids = _combine_all_along_first_dim(
233 combined_ids,
234 dim=concat_dim,
235 data_vars=data_vars,
236 coords=coords,
237 compat=compat,
238 fill_value=fill_value,
239 join=join,
240 combine_attrs=combine_attrs,
241 )
242 (combined_ds,) = combined_ids.values()
243 return combined_ds
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:267, in _combine_all_along_first_dim(combined_ids, dim, data_vars, coords, compat, fill_value, join, combine_attrs)
265 combined_ids = dict(sorted(group))
266 datasets = combined_ids.values()
--> 267 new_combined_ids[new_id] = _combine_1d(
268 datasets, dim, compat, data_vars, coords, fill_value, join, combine_attrs
269 )
270 return new_combined_ids
File ~/Documents/Work/Code/xarray/xarray/core/combine.py:290, in _combine_1d(datasets, concat_dim, compat, data_vars, coords, fill_value, join, combine_attrs)
288 if concat_dim is not None:
289 try:
--> 290 combined = concat(
291 datasets,
292 dim=concat_dim,
293 data_vars=data_vars,
294 coords=coords,
295 compat=compat,
296 fill_value=fill_value,
297 join=join,
298 combine_attrs=combine_attrs,
299 )
300 except ValueError as err:
301 if "encountered unexpected variable" in str(err):
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:276, in concat(objs, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
263 return _dataarray_concat(
264 objs,
265 dim=dim,
(...)
273 create_index_for_new_dim=create_index_for_new_dim,
274 )
275 elif isinstance(first_obj, Dataset):
--> 276 return _dataset_concat(
277 objs,
278 dim=dim,
279 data_vars=data_vars,
280 coords=coords,
281 compat=compat,
282 positions=positions,
283 fill_value=fill_value,
284 join=join,
285 combine_attrs=combine_attrs,
286 create_index_for_new_dim=create_index_for_new_dim,
287 )
288 else:
289 raise TypeError(
290 "can only concatenate xarray Dataset and DataArray "
291 f"objects, got {type(first_obj)}"
292 )
File ~/Documents/Work/Code/xarray/xarray/core/concat.py:662, in _dataset_concat(datasets, dim, data_vars, coords, compat, positions, fill_value, join, combine_attrs, create_index_for_new_dim)
660 result_vars[k] = v
661 else:
--> 662 combined_var = concat_vars(
663 vars, dim, positions, combine_attrs=combine_attrs
664 )
665 # reindex if variable is not present in all datasets
666 if len(variable_index) < concat_index_size:
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:2986, in concat(variables, dim, positions, shortcut, combine_attrs)
2984 return IndexVariable.concat(variables, dim, positions, shortcut, combine_attrs)
2985 else:
-> 2986 return Variable.concat(variables, dim, positions, shortcut, combine_attrs)
File ~/Documents/Work/Code/xarray/xarray/core/variable.py:1737, in Variable.concat(cls, variables, dim, positions, shortcut, combine_attrs)
1735 axis = first_var.get_axis_num(dim)
1736 dims = first_var_dims
-> 1737 data = duck_array_ops.concatenate(arrays, axis=axis)
1738 if positions is not None:
1739 # TODO: deprecate this option -- we don't need it for groupby
1740 # any more.
1741 indices = nputils.inverse_permutation(np.concatenate(positions))
File ~/Documents/Work/Code/xarray/xarray/core/duck_array_ops.py:402, in concatenate(arrays, axis)
400 xp = get_array_namespace(arrays[0])
401 return xp.concat(as_shared_dtype(arrays, xp=xp), axis=axis)
--> 402 return _concatenate(as_shared_dtype(arrays), axis=axis)
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array.py:121, in ManifestArray.__array_function__(self, func, types, args, kwargs)
118 if not all(issubclass(t, ManifestArray) for t in types):
119 return NotImplemented
--> 121 return MANIFESTARRAY_HANDLED_ARRAY_FUNCTIONS[func](*args, **kwargs)
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:110, in concatenate(arrays, axis)
107 raise TypeError()
109 # ensure dtypes, shapes, codecs etc. are consistent
--> 110 _check_combineable_zarr_arrays(arrays)
112 _check_same_ndims([arr.ndim for arr in arrays])
114 # Ensure we handle axis being passed as a negative integer
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:38, in _check_combineable_zarr_arrays(arrays)
34 _check_same_dtypes([arr.dtype for arr in arrays])
36 # Can't combine different codecs in one manifest
37 # see https://github.com/zarr-developers/zarr-specs/issues/288
---> 38 _check_same_codecs([arr.zarray.codec for arr in arrays])
40 # Would require variable-length chunks ZEP
41 _check_same_chunk_shapes([arr.chunks for arr in arrays])
File ~/Documents/Work/Code/virtualizarr/virtualizarr/manifests/array_api.py:59, in _check_same_codecs(codecs)
57 for codec in other_codecs:
58 if codec != first_codec:
---> 59 raise NotImplementedError(
60 "The ManifestArray class cannot concatenate arrays which were stored using different codecs, "
61 f"But found codecs {first_codec} vs {codec} ."
62 "See https://github.com/zarr-developers/zarr-specs/issues/288"
63 )
NotImplementedError: The ManifestArray class cannot concatenate arrays which were stored using different codecs, But found codecs compressor=None filters=[{'id': 'zlib', 'level': 4}] vs compressor=None filters=[{'elementsize': 4, 'id': 'shuffle'}, {'id': 'zlib', 'level': 4}] .See https://github.com/zarr-developers/zarr-specs/issues/288 |
Thanks for looking into this Tom, I'll stick to variables with the same codec if possible to avoid that issue. |
@ghidalgo3 I believe the codec limitation issue is that Daymet slightly changed the compression scheme at some point in their production process and introduced a shuffle filter in the compression chain. So the codec chain varies between dates. I'd need to experiment a bit to see when this was changed as this doesn't appear to be documented in a place I can find https://daymet.ornl.gov/overview. |
I'm trying to build a single virtual datastore from a collection of NetCDF files with VirtualiZarr using the files from the Daymet dataset on the Microsoft Planetary Computer. There is a NetCDF file for each (year, variable) combination, so I'm thinking that to build a single datastore I need to
xr.concat
on timexr.merge
on variablesRoughly here's what I'm doing:
And here is the stack trace from
xr.merge
:If this is a known issue, feel free to close this and link to it. If this is new, what would it take to implement a new Chunk Manager? Is there a way to make
merge
work? I'm runningVirtualiZarr
frommain
at commitc3f630bfbb6c5
.The text was updated successfully, but these errors were encountered: