Skip to content

Coordinate inheritance for xarray.DataTree #9077

Closed
@shoyer

Description

@shoyer

What is your issue?

Should coordinate variables be inherited between different levels of an Xarray DataTree?

The DataTree object is intended to represent hierarchical groups of data in Xarray, similar to the role of sub-directories in a filesystem or HDF5/netCDF4 groups. A key design question is if/how to enforce coordinate consistency between different levels of a DataTree hierarchy.

As a concrete example of how enforcing coordinate consistency could be useful, consider the following hypothetical DataTree, representing a mix of weather data and satellite images:
image

Here there are four different coordinate variables, which apply to variables in the DataTree in different ways:

  • time is a shared coordinate used by both weather and satellite variables
  • station is used only for weather variables
  • x and y are only use for satellite images

In this data model, coordinate variables are inherited to descendent nodes, which means that variables at different levels of a hierarchical DataTree are always aligned. Placing the time variable at the root node automatically indicates that it applies to all descendent nodes. Similarly, station is in the base weather_data node, because it applies to all weather variables, both directly in weather_data and in the temperature sub-tree. Accessing any of the lower level trees as an xarray.Dataset would automatically include coordinates from higher levels (e.g., time).

In an alternative data model, coordinate variables at every level of a DataTree are independent. This is the model currently implemented in the experimental DataTree project. To represent the same data, coordinate variables would need to be duplicated alongside data variables at every level of the hierarchy:
image

Which data model to prefer depends on which of two considerations we value more:

  1. Consistency: Automatically inherited coordinates will allow for DataTree objects with fewer redundant variables, which is easier to understand at a glance, similar to the role of the shared coordinate system on xarray.Dataset. You don’t need to separately check the time coordinates on the weather and satellite data to know that they are the same. Alignment, including matching coordinates and dimension sizes, is enforced by the data model.
  2. Flexibility: Enforcing consistency limits how you can organize data, because conflicting coordinates at different levels of a DataTree can no longer be represented in Xarray’s data model. In particular, some valid multi-group netCDF4 files/Zarr could not be loaded into a single DataTree object.

As a concrete example of what we lose in flexibility, consider the following two representations of an multiscale image pyramid, where each level of zoom has different x and y coordinates:

image

The version that places the base image at the root of the hierarchy would not be allowed in the inherited coordinates data model, because there would be conflicting x and y coordinates (or dimension sizes) between the root and child nodes. Instead, different levels of zoom would need to be placed under different groups (zoom_1x, zoom_2x, etc).

As we consider making this change to the (as yet unreleased) DataTree object in Xarray, I have two questions for prospective DataTree users:

  1. Do you agree that giving up the flexibility of independent coordinates in favor of a data model that bakes in more consistency guarantees is a good idea?
  2. Do you have existing uses for DataTree objects or multi-group netCDF/Zarr files that would be positively or negatively impacted by this change?

xref: #9063, #9056

CC @TomNicholas, @keewis, @owenlittlejohns, @flamingbear, @eni-awowale

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions