Description
What is your issue?
Should coordinate variables be inherited between different levels of an Xarray DataTree?
The DataTree object is intended to represent hierarchical groups of data in Xarray, similar to the role of sub-directories in a filesystem or HDF5/netCDF4 groups. A key design question is if/how to enforce coordinate consistency between different levels of a DataTree hierarchy.
As a concrete example of how enforcing coordinate consistency could be useful, consider the following hypothetical DataTree, representing a mix of weather data and satellite images:
Here there are four different coordinate variables, which apply to variables in the DataTree in different ways:
time
is a shared coordinate used by both weather and satellite variablesstation
is used only for weather variablesx
andy
are only use for satellite images
In this data model, coordinate variables are inherited to descendent nodes, which means that variables at different levels of a hierarchical DataTree are always aligned. Placing the time
variable at the root node automatically indicates that it applies to all descendent nodes. Similarly, station
is in the base weather_data
node, because it applies to all weather variables, both directly in weather_data
and in the temperature
sub-tree. Accessing any of the lower level trees as an xarray.Dataset
would automatically include coordinates from higher levels (e.g., time
).
In an alternative data model, coordinate variables at every level of a DataTree are independent. This is the model currently implemented in the experimental DataTree project. To represent the same data, coordinate variables would need to be duplicated alongside data variables at every level of the hierarchy:
Which data model to prefer depends on which of two considerations we value more:
- Consistency: Automatically inherited coordinates will allow for DataTree objects with fewer redundant variables, which is easier to understand at a glance, similar to the role of the shared coordinate system on xarray.Dataset. You don’t need to separately check the
time
coordinates on the weather and satellite data to know that they are the same. Alignment, including matching coordinates and dimension sizes, is enforced by the data model. - Flexibility: Enforcing consistency limits how you can organize data, because conflicting coordinates at different levels of a DataTree can no longer be represented in Xarray’s data model. In particular, some valid multi-group netCDF4 files/Zarr could not be loaded into a single DataTree object.
As a concrete example of what we lose in flexibility, consider the following two representations of an multiscale image pyramid, where each level of zoom has different x and y coordinates:

The version that places the base image at the root of the hierarchy would not be allowed in the inherited coordinates data model, because there would be conflicting x and y coordinates (or dimension sizes) between the root and child nodes. Instead, different levels of zoom would need to be placed under different groups (zoom_1x
, zoom_2x
, etc).
As we consider making this change to the (as yet unreleased) DataTree object in Xarray, I have two questions for prospective DataTree users:
- Do you agree that giving up the flexibility of independent coordinates in favor of a data model that bakes in more consistency guarantees is a good idea?
- Do you have existing uses for DataTree objects or multi-group netCDF/Zarr files that would be positively or negatively impacted by this change?
CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale