Description
I am using xarray for processing geospatial data and have encountered two major challenges with existing data structures in xarray:
-
Data arrays stored in an xarray Dataset cannot be grouped into hierarchical levels/logical subsets to reflect the internal organisation of the data. This makes it difficult to identify and process a subset of the data variables that pertain to a specific problem.
-
When two data arrays having a shared dimension but different coordinate values along the dimension are merged into a Dataset, the union of coordinate values from the 2 data arrays becomes the new coordinate set corresponding to that dimension. Consequently, when the value of a variable in the dataset corresponding to a coordinate value is unknown,
nan
is used as a substitute which results in memory wastage.
I would like to suggest a tree-based data structure for xarray in which the leaves store individual data arrays and the other nodes store the hierarchical information. Since data arrays are stored independently, each dimension only needs to be associated with coordinate values that are valid for that data array.
To meet these requirements, I have implemented a data structure that also supports the below capabilities:
- Standard xarray methods can be applied to the tree at all hierarchical levels, i.e., when a function is called at a hierarchical level, it is mapped over all data arrays that occur at the leaves under the corresponding node. For example, say I have a tree object (lets call it
dt
) with child nodes:weather
,satellite image
andpopulation
. Each of these nodes has data arrays/subtrees under it.
The mean over time of all data variables associated with weather can be obtained using dt.weather.mean('time')
which applies the function to sea_surface_temperature
, dew_point_temperature
, wind_speed
and pressure
.
- It can be encoded into the netCDF format, like xarray Datasets.
- It supports item assignment at all hierarchical levels.
I would like to know of the possibility of introducing such a data structure in xarray and the challenges involved in the same.