Description
Is your feature request related to a problem?
In #9077 (comment) I suggested the idea of a function which could open any netCDF file with groups as a dictionary mapping group path strings to xr.Dataset
objects.
The motivation is as follows:
- People want the new
xarray.DataTree
class to support inheriting coordinates from parent groups, - This can only be done if the coordinates align with the variables in the child group (i.e. using
xr.align
), - The best time to enforce this alignment is at
DataTree
construction time, - This requirement is not enforced in the netCDF/Zarr model, so this would mean some files can no longer be opened by
open_datatree
directly, as doing so would raise an alignment error, - But we still really want users to have some way to open an arbitrary file with xarray and see what's inside (including displaying all the groups Opening a dataset doesn't display groups. #4840).
- A simpler intermediate structure of a dictionary mapping group paths to
xarray.Dataset
objects doesn't enforce alignment, so can represent any file. - We should add a new opening function to allow any file to be opened as this dict-of-datasets structure.
- Users can then use this to inspect "untidy" data, and make changes to the dict returned before creating an aligned
DataTree
object viaDataTree.from_dict
if they like.
Describe the solution you'd like
Add a function like this:
def open_dict_of_datasets(
filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
engine: T_Engine = None,
group: Optional[str] = None,
**kwargs,
) -> dict[str, Dataset]:
"""
Open and decode a file or file-like object, creating a dictionary containing one xarray Dataset for each group in the file.
Useful when you have e.g. a netCDF file containing many groups, some of which are not alignable with their parents and so the file cannot be opened directly with ``open_datatree``.
It is encouraged to use this function to inspect your data, then make the necessary changes to make the structure coercible to a `DataTree` object before calling `DataTree.from_dict()` and proceeding with your analysis.
Parameters
----------
filename_or_obj : str, Path, file-like, or DataStore
Strings and Path objects are interpreted as a path to a netCDF file or Zarr store.
engine : str, optional
Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
group : str, optional
Group to use as the root group to start reading from. Groups above this root group will not be included in the output.
**kwargs : dict
Additional keyword arguments passed to :py:func:`~xarray.open_dataset` for each group.
Returns
-------
dict[str, xarray.Dataset]
See Also
--------
open_datatree()
DataTree.from_dict()
"""
...
This would live inside backends.api.py
, and be exposed publicly as a top-level function along with the rest of open_datatree
/DataTree
etc. as part of #9033.
The actual implementation could re-use the code for opening many groups of the same file performantly from #9014. Indeed we could add a open_dict_of_datasets
method to the BackendEntryPoint
class, which uses pretty much the same code as the existing open_datatree
method added in #9014 but just doesn't actually create a DataTree
object.
Describe alternatives you've considered
Really the main alternative to this is not to have coordinate inheritance in DataTree
at all (see 9077), in which case open_datatree
would be sufficient to open any file.
The name of the function is up for debate. I prefer nothing with the word "datatree" in it since this doesn't actually create a DataTree
object at any point. (In fact we could and perhaps should have implemented this function years ago, even without the new DataTree
class.) The reason for not calling it "open_as_dict_of_datasets
" is that we don't use "as" in the existing open_dataset
/open_dataarray
etc.
Additional context
cc @eni-awowale @flamingbear @owenlittlejohns @keewis @shoyer @autydp
Metadata
Metadata
Assignees
Type
Projects
Status