Skip to content

open_dict_of_datasets function to open any file containing nested groups #9137

Closed
@TomNicholas

Description

@TomNicholas

Is your feature request related to a problem?

In #9077 (comment) I suggested the idea of a function which could open any netCDF file with groups as a dictionary mapping group path strings to xr.Dataset objects.

The motivation is as follows:

  • People want the new xarray.DataTree class to support inheriting coordinates from parent groups,
  • This can only be done if the coordinates align with the variables in the child group (i.e. using xr.align),
  • The best time to enforce this alignment is at DataTree construction time,
  • This requirement is not enforced in the netCDF/Zarr model, so this would mean some files can no longer be opened by open_datatree directly, as doing so would raise an alignment error,
  • But we still really want users to have some way to open an arbitrary file with xarray and see what's inside (including displaying all the groups Opening a dataset doesn't display groups. #4840).
  • A simpler intermediate structure of a dictionary mapping group paths to xarray.Dataset objects doesn't enforce alignment, so can represent any file.
  • We should add a new opening function to allow any file to be opened as this dict-of-datasets structure.
  • Users can then use this to inspect "untidy" data, and make changes to the dict returned before creating an aligned DataTree object via DataTree.from_dict if they like.

Describe the solution you'd like

Add a function like this:

def open_dict_of_datasets(
    filename_or_obj: str | os.PathLike[Any] | BufferedIOBase | AbstractDataStore,
    engine: T_Engine = None,
    group: Optional[str] = None,
    **kwargs,
) -> dict[str, Dataset]:
    """
    Open and decode a file or file-like object, creating a dictionary containing one xarray Dataset for each group in the file.

    Useful when you have e.g. a netCDF file containing many groups, some of which are not alignable with their parents and so the file cannot be opened directly with ``open_datatree``.

    It is encouraged to use this function to inspect your data, then make the necessary changes to make the structure coercible to a `DataTree` object before calling `DataTree.from_dict()` and proceeding with your analysis.

    Parameters
    ----------
    filename_or_obj : str, Path, file-like, or DataStore
        Strings and Path objects are interpreted as a path to a netCDF file or Zarr store.
    engine : str, optional
        Xarray backend engine to use. Valid options include `{"netcdf4", "h5netcdf", "zarr"}`.
    group : str, optional
        Group to use as the root group to start reading from. Groups above this root group will not be included in the output.
    **kwargs : dict
        Additional keyword arguments passed to :py:func:`~xarray.open_dataset` for each group.

    Returns
    -------
    dict[str, xarray.Dataset]

    See Also
    --------
    open_datatree()
    DataTree.from_dict()
    """
    ...

This would live inside backends.api.py, and be exposed publicly as a top-level function along with the rest of open_datatree/DataTree etc. as part of #9033.

The actual implementation could re-use the code for opening many groups of the same file performantly from #9014. Indeed we could add a open_dict_of_datasets method to the BackendEntryPoint class, which uses pretty much the same code as the existing open_datatree method added in #9014 but just doesn't actually create a DataTree object.

Describe alternatives you've considered

Really the main alternative to this is not to have coordinate inheritance in DataTree at all (see 9077), in which case open_datatree would be sufficient to open any file.


The name of the function is up for debate. I prefer nothing with the word "datatree" in it since this doesn't actually create a DataTree object at any point. (In fact we could and perhaps should have implemented this function years ago, even without the new DataTree class.) The reason for not calling it "open_as_dict_of_datasets" is that we don't use "as" in the existing open_dataset/open_dataarray etc.

Additional context

cc @eni-awowale @flamingbear @owenlittlejohns @keewis @shoyer @autydp

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    Status

    Done

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions