Description
Motivation
Accessing variables from parent groups in a tree would be useful. This has come up before in #1982 and xarray-contrib/datatree#297. Here I'm going to summarize some discussion from recent datatree meetings .
A use case is to have common coordinate variables between multiple sub-groups, for example this multi-resolution datatree has a time
coordinate that conceptually is common to two groups:
DataTree('None', parent=None)
│ Dimensions: (time: 4)
│ Coordinates:
│ * time (time) int64 32B 0 1 2 3
│ Data variables:
│ *empty*
├── DataTree('low')
│ Dimensions: (x: 3, time: 4)
│ Coordinates:
│ * x (x) float64 24B 1.0 5.0 9.0
│ Dimensions without coordinates: time
│ Data variables:
│ a (x, time) int64 96B 0 1 2 3 4 5 6 7 8 9 10 11
└── DataTree('high')
Dimensions: (x: 9, time: 4)
Coordinates:
* x (x) float64 72B 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0
Dimensions without coordinates: time
Data variables:
a (x, time) int64 288B 0 1 2 3 4 5 6 7 8 ... 28 29 30 31 32 33 34 35
It would be useful to be able to access the time
coordinate variable from either child group, i.e. dt['/high'].time
.
Indeed, the CF conventions explicitly describe this type of behaviour, in terms of searching for variables outside of the current group
Search by proximity
A variable or dimension specified with no path (for example,
lat
) refers to the variable or dimension of that name, if there is one, in the referring group. If not, the ancestors of the referring group are searched for it, starting from the direct ancestor and proceeding toward the root group, until it is found.
Problem
We could imagine changing the interface of DataTree
to allow users to access any compatible variables on parent groups, where compatible means alignable.
There are three issues with this:
- Not all users will want to inherit all such variables,
- It would be a breaking change compared to the behaviour of the original datatree package,
- Mapping operations (e.g.
.mean()
) over multiple nodes becomes really confusing, because copies of the same variable would effectively be present in multiple nodes.
Proposal
Let me make a concrete feature proposal for discussion, which has some specific features:
-
Keep
.ds
,.__getitem__
etc. onDataTree
as-is. This means no breaking of backwards compatibility. This also means that we don't have to wait to implement all the details of this before releasing datatree in xarraymain
. -
A clear definition of "compatible variables" for inheritance. These are alignable variables that exist on a parent (or grandparent etc.) Q: Should these be just coordinate variables? Or all variables?
-
Add additional API which allows access to inherited variables, via a new
.inherit
accessor onDataTree
objects. (The name is not great, please feel free to suggest alternatives.)- Whilst
dt[...]
will never give access to inherited vars,dt.inherit[...]
would allow__getitem__
access to inherited vars dt.inherit.ds
would return aDatasetView
of that node with extra inherited variables in itdt.inherit.to_dataset()
->xr.Dataset
containing inherited vars- Explicit API for propagating / shallow-copying all variables to child nodes?
dt.inherit()
? ->DataTree
- Whilst
-
Don't change
map_over_subtree
(again for backwards compatibility)map_over_inherited_subtree
isolates the conceptuals of mapping over tree with inherited variables- issues: e.g. map over and see the same variable multiple times (in its "local" group and in all its child groups)
This will be a new feature, to be done in a separate release (i.e. no blocker right now)
Implementation
dt.inherit
returns an InheritedNode
, which at construction time creates and caches a mapping of all inherited variables (._inherited_variables
). This then acts like a normal DataTree
node except that it consults the inherited variables instead of the normal list of variables.
Creating the list of inherited variables is done by walking up the tree from the current node, examining new variables as they are encountered.
Q: Does this design handle coordinate names?
EDIT: Actually there's an even simpler idea: ds.inherit
-> DataTree
which has a shallow copy of all compatible variables inherited onto that node. Then .ds
, .__getitem__
etc. will automatically behave as expected, as you will just have a new DataTree
object with more valid keys.
Describe alternatives you've considered
- Not add any support for inheriting variables
That's what we currently have, and with this proposal we could eventually remove it if it turned out no-one liked it.
- Integrate support into the existing API (i.e. change
dt.__getitem__
to access inherited variables)
It's not possible to do this without breaking changes. It's also not clear that there is a general one-size-fits-all answer to when variables should or shouldn't be inherited. This proposal provides both behaviours.
- Allow users to change behaviour of objects
Some kind of switch (on the specific object instances, globally, or with a context manager) could be used to switch between the two behaviours. But this seems extremely error-prone, and means that user code becomes ambiguous without knowing the state of the switch.
cc @shoyer @keewis @flamingbear @owenlittlejohns @eni-awowale
also @alexamici @benbovy I would love to hear your thoughts too.
Metadata
Metadata
Assignees
Type
Projects
Status