Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docs page on interoperability #7992

Merged
merged 32 commits into from
Oct 26, 2023
Merged
Show file tree
Hide file tree
Changes from 23 commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
198f67b
add page on internal design
TomNicholas Jul 17, 2023
9fe6635
add xarray-datatree to intersphinx mapping
TomNicholas Jul 17, 2023
c231f58
typo
TomNicholas Jul 17, 2023
957675d
add subheadings to the accessors page
TomNicholas Jul 17, 2023
9fd2fc5
Revert "add page on internal design"
TomNicholas Jul 17, 2023
04fcab2
rename page on variables
TomNicholas Jul 17, 2023
40fc8c5
whatsnew
TomNicholas Jul 17, 2023
011ab25
page on interoperability
TomNicholas Jul 17, 2023
869363a
add interoperability page to index
TomNicholas Jul 17, 2023
3abe029
fix whatsnew
TomNicholas Jul 17, 2023
39437fb
Merge branch 'main' into docs_internal_design
TomNicholas Jul 17, 2023
2ab30d6
Merge branch 'main' into docs_interoperability
TomNicholas Jul 17, 2023
e688526
Merge branch 'main' into docs_interoperability
TomNicholas Jul 17, 2023
3809d50
sel->isel
TomNicholas Jul 17, 2023
1e98361
Merge branch 'docs_internal_design' of https://github.com/TomNicholas…
TomNicholas Jul 17, 2023
3ad0722
Merge branch 'docs_internal_design' into docs_interoperability
TomNicholas Jul 17, 2023
4afea37
add section on lazy indexing
TomNicholas Jul 18, 2023
84e9aa2
actually show lazy indexing example
TomNicholas Jul 18, 2023
0e0a240
Merge branch 'main' into docs_internal_design
TomNicholas Jul 18, 2023
4e98b58
Merge branch 'docs_internal_design' into docs_interoperability
TomNicholas Jul 18, 2023
1c2a5b7
link to custom indexes page
TomNicholas Jul 18, 2023
c8b2653
fix some formatting
TomNicholas Jul 18, 2023
2964c60
put encoding last
TomNicholas Jul 18, 2023
4e41c42
Merge branch 'main' into docs_interoperability
TomNicholas Jul 25, 2023
6e1240f
attrs and encoding are not ordered dicts
TomNicholas Jul 25, 2023
8bc561a
Merge branch 'main' into docs_interoperability
TomNicholas Sep 13, 2023
40e799a
Merge branch 'main' into docs_interoperability
TomNicholas Oct 4, 2023
b2b8338
reword lack of support for subclassing
TomNicholas Oct 4, 2023
1f9c17c
Merge branch 'main' into docs_interoperability
TomNicholas Oct 4, 2023
a4a72cd
remove duplicate word
TomNicholas Oct 4, 2023
bc0d55d
encourage contributions to supporting subclassing
TomNicholas Oct 4, 2023
ae4619b
Merge branch 'main' into docs_interoperability
TomNicholas Oct 26, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -324,6 +324,7 @@
"cftime": ("https://unidata.github.io/cftime", None),
"sparse": ("https://sparse.pydata.org/en/latest/", None),
"cubed": ("https://tom-e-white.com/cubed/", None),
"datatree": ("https://xarray-datatree.readthedocs.io/en/latest/", None),
}


Expand Down
2 changes: 1 addition & 1 deletion doc/internals/duck-arrays-integration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ Python Array API standard support
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

As an integration library xarray benefits greatly from the standardization of duck-array libraries' APIs, and so is a
big supporter of the `Python Array API Standard <https://data-apis.org/array-api/latest/>`_. .
big supporter of the `Python Array API Standard <https://data-apis.org/array-api/latest/>`_.

We aim to support any array libraries that follow the Array API standard out-of-the-box. However, xarray does occasionally
call some numpy functions which are not (yet) part of the standard (e.g. :py:meth:`xarray.DataArray.pad` calls :py:func:`numpy.pad`).
Expand Down
10 changes: 10 additions & 0 deletions doc/internals/extending-xarray.rst
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,11 @@ Xarray is designed as a general purpose library and hence tries to avoid
including overly domain specific functionality. But inevitably, the need for more
domain specific logic arises.

.. _internals.accessors.composition:

Composition over Inheritance
----------------------------

One potential solution to this problem is to subclass Dataset and/or DataArray to
add domain specific functionality. However, inheritance is not very robust. It's
easy to inadvertently use internal APIs when subclassing, which means that your
Expand All @@ -28,6 +33,11 @@ If you simply want the ability to call a function with the syntax of a
method call, then the builtin :py:meth:`~xarray.DataArray.pipe` method (copied
from pandas) may suffice.

.. _internals.accessors.writing accessors:

Writing Custom Accessors
------------------------

To resolve this issue for more complex cases, xarray has the
:py:func:`~xarray.register_dataset_accessor` and
:py:func:`~xarray.register_dataarray_accessor` decorators for adding custom
Expand Down
2 changes: 2 additions & 0 deletions doc/internals/how-to-create-custom-index.rst
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
.. currentmodule:: xarray

.. _internals.custom indexes:

How to create a custom index
============================

Expand Down
12 changes: 6 additions & 6 deletions doc/internals/index.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _internals:

xarray Internals
Xarray Internals
================

Xarray builds upon two of the foundational libraries of the scientific Python
Expand All @@ -11,18 +11,18 @@ compiled code to :ref:`optional dependencies<installing>`.
The pages in this section are intended for:

* Contributors to xarray who wish to better understand some of the internals,
* Developers who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
* Developers who wish to interface xarray with their existing tooling, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.

* Developers from other fields who wish to extend xarray with domain-specific logic, perhaps to support a new scientific community of users,
* Developers of other packages who wish to interface xarray with their existing tools, e.g. by creating a plugin for reading a new file format, or wrapping a custom array type.

.. toctree::
:maxdepth: 2
:hidden:

variable-objects
internal-design
interoperability
duck-arrays-integration
chunked-arrays
extending-xarray
zarr-encoding-spec
how-to-add-new-backend
how-to-create-custom-index
zarr-encoding-spec
224 changes: 224 additions & 0 deletions doc/internals/internal-design.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
.. ipython:: python
:suppress:

import numpy as np
import pandas as pd
import xarray as xr

np.random.seed(123456)
np.set_printoptions(threshold=20)

.. _internal design:

Internal Design
===============

This page gives an overview of the internal design of xarray.

In totality, the Xarray project defines 4 key data structures.
In order of increasing complexity, they are:

- :py:class:`xarray.Variable`,
- :py:class:`xarray.DataArray`,
- :py:class:`xarray.Dataset`,
- :py:class:`datatree.DataTree`.

The user guide lists only :py:class:`xarray.DataArray` and :py:class:`xarray.Dataset`,
but :py:class:`~xarray.Variable` is the fundamental object internally,
and :py:class:`~datatree.DataTree` is a natural generalisation of :py:class:`xarray.Dataset`.

.. note::

Our :ref:`roadmap` includes plans both to document :py:class:`~xarray.Variable` as fully public API,
and to merge the `xarray-datatree <https://github.com/xarray-contrib/datatree>`_ package into xarray's main repository.

Internally private :ref:`lazy indexing classes <internal design.lazy indexing>` are used to avoid loading more data than necessary,
and flexible indexes classes (derived from :py:class:`~xarray.indexes.Index`) provide performant label-based lookups.


.. _internal design.data structures:

Data Structures
---------------

The :ref:`data structures` page in the user guide explains the basics and concentrates on user-facing behavior,
whereas this section explains how xarray's data structure classes actually work internally.


.. _internal design.data structures.variable:

Variable Objects
~~~~~~~~~~~~~~~~

The core internal data structure in xarray is the :py:class:`~xarray.Variable`,
which is used as the basic building block behind xarray's
:py:class:`~xarray.Dataset`, :py:class:`~xarray.DataArray` types. A
:py:class:`~xarray.Variable` consists of:

- ``dims``: A tuple of dimension names.
- ``data``: The N-dimensional array (typically a NumPy or Dask array) storing
the Variable's data. It must have the same number of dimensions as the length
of ``dims``.
- ``attrs``: An ordered dictionary of metadata associated with this array. By
convention, xarray's built-in operations never use this metadata.
- ``encoding``: Another ordered dictionary used to store information about how
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
these variable's data is represented on disk. See :ref:`io.encoding` for more
details.

:py:class:`~xarray.Variable` has an interface similar to NumPy arrays, but extended to make use
of named dimensions. For example, it uses ``dim`` in preference to an ``axis``
argument for methods like ``mean``, and supports :ref:`compute.broadcasting`.

However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not
include coordinate labels along each axis.

:py:class:`~xarray.Variable` is public API, but because of its incomplete support for labeled
data, it is mostly intended for advanced uses, such as in xarray itself, for
writing new backends, or when creating custom indexes.
You can access the variable objects that correspond to xarray objects via the (readonly)
:py:attr:`Dataset.variables <xarray.Dataset.variables>` and
:py:attr:`DataArray.variable <xarray.DataArray.variable>` attributes.


.. _internal design.dataarray:

DataArray Objects
~~~~~~~~~~~~~~~~~

The simplest data structure used by most users is :py:class:`~xarray.DataArray`.
A :py:class:`~xarray.DataArray` is a composite object consisting of multiple
:py:class:`~xarray.core.variable.Variable` objects which store related data.

A single :py:class:`~xarray.core.Variable` is referred to as the "data variable", and stored under the :py:attr:`~xarray.DataArray.variable`` attribute.
A :py:class:`~xarray.DataArray` inherits all of the properties of this data variable, i.e. ``dims``, ``data``, ``attrs`` and ``encoding``,
all of which are implemented by forwarding on to the underlying ``Variable`` object.

In addition, a :py:class:`~xarray.DataArray` stores additional ``Variable`` objects stored in a dict under the private ``_coords`` attribute,
each of which is referred to as a "Coordinate Variable". These coordinate variable objects are only allowed to have ``dims`` that are a subset of the data variable's ``dims``,
and each dim has a specific length. This means that the full :py:attr:`~xarray.DataArray.size` of the dataarray can be represented by a dictionary mapping dimension names to integer sizes.
The underlying data variable has this exact same size, and the attached coordinate variables have sizes which are some subset of the size of the data variable.
Another way of saying this is that all coordinate variables must be "alignable" with the data variable.

When a coordinate is accessed by the user (e.g. via the dict-like :py:class:`~xarray.DataArray.__getitem__` syntax),
then a new ``DataArray`` is constructed by finding all coordinate variables that have compatible dimensions and re-attaching them before the result is returned.
This is why most users never see the ``Variable`` class underlying each coordinate variable - it is always promoted to a ``DataArray`` before returning.

Lookups are performed by special :py:class:`~xarray.indexes.Index` objects, which are stored in a dict under the private ``_indexes`` attribute.
Indexes must be associated with one or more coordinates, and essentially act by translating a query given in physical coordinate space
(typically via the :py:meth:`~xarray.DataArray.sel` method) into a set of integer indices in array index space that can be used to index the underlying n-dimensional array-like ``data``.
Indexing in array index space (typically performed via the :py:meth:`~xarray.DataArray.isel` method) does not require consulting an ``Index`` object.

Finally a :py:class:`~xarray.DataArray` defines a :py:attr:`~xarray.DataArray.name` attribute, which refers to its data
variable but is stored on the wrapping ``DataArray`` class.
The ``name`` attribute is primarily used when one or more :py:class:`~xarray.DataArray` objects are promoted into a :py:class:`~xarray.Dataset`
(e.g. via :py:meth:`~xarray.DataArray.to_dataset`).
Note that the underlying :py:class:`~xarray.core.Variable` objects are all unnamed, so they can always be referred to uniquely via a
dict-like mapping.

.. _internal design.dataset:

Dataset Objects
~~~~~~~~~~~~~~~

The :py:class:`~xarray.Dataset` class is a generalization of the :py:class:`~xarray.DataArray` class that can hold multiple data variables.
Internally all data variables and coordinate variables are stored under a single ``variables`` dict, and coordinates are
specified by storing their names in a private ``_coord_names`` dict.

The dataset's ``dims`` are the set of all dims present across any variable, but (similar to in dataarrays) coordinate
variables cannot have a dimension that is not present on any data variable.

When a data variable or coordinate variable is accessed, a new ``DataArray`` is again constructed from all compatible
coordinates before returning.

.. _internal design.subclassing:

.. note::

The way that selecting a variable from a ``DataArray`` or ``Dataset`` actually involves internally wrapping the
``Variable`` object back up into a ``DataArray``/``Dataset`` is the primary reason :ref:`we recommend against subclassing <internals.accessors.composition>`
Xarray objects. The main problem it creates is that we currently cannot easily guarantee that for example selecting
a coordinate variable from your ``SubclassedDataArray`` would return an instance of ``SubclassedDataArray`` instead
of just an :py:class:`xarray.DataArray`. See `GH issue <https://github.com/pydata/xarray/issues/3980>`_ for more details.

.. _internal design.lazy indexing:

Lazy Indexing Classes
---------------------

Lazy Loading
~~~~~~~~~~~~

If we open a ``Variable`` object from disk using :py:func:`~xarray.open_dataset` we can see that the actual values of
the array wrapped by the data variable are not displayed.

.. ipython:: python

da = xr.tutorial.open_dataset("air_temperature")["air"]
var = da.variable
var

We can see the size, and the dtype of the underlying array, but not the actual values.
This is because the values have not yet been loaded.

If we look at the private attribute :py:meth:`~xarray.Variable._data` containing the underlying array object, we see
something interesting:

.. ipython:: python

var._data

You're looking at one of xarray's internal `Lazy Indexing Classes`. These powerful classes are hidden from the user,
but provide important functionality.

Calling the public :py:attr:`~xarray.Variable.data` property loads the underlying array into memory.

.. ipython:: python

var.data

This array is now cached, which we can see by accessing the private attribute again:

.. ipython:: python

var._data

Lazy Indexing
~~~~~~~~~~~~~

The purpose of these lazy indexing classes is to prevent more data being loaded into memory than is necessary for the
subsequent analysis, by deferring loading data until after indexing is performed.

Let's open the data from disk again.

.. ipython:: python

da = xr.tutorial.open_dataset("air_temperature")["air"]
var = da.variable

Now, notice how even after subsetting the data has does not get loaded:

.. ipython:: python

var.isel(time=0)

The shape has changed, but the values are still not shown.

Looking at the private attribute again shows how this indexing information was propagated via the hidden lazy indexing classes:

.. ipython:: python

var.isel(time=0)._data

.. note::

Currently only certain indexing operations are lazy, not all array operations. For discussion of making all array
operations lazy see `GH issue #5081 <https://github.com/pydata/xarray/issues/5081>`_.


Lazy Dask Arrays
~~~~~~~~~~~~~~~~

Note that xarray's implementation of Lazy Indexing classes is completely separate from how :py:class:`dask.array.Array`
objects evaluate lazily. Dask-backed xarray objects delay almost all operations until :py:meth:`~xarray.DataArray.compute`
is called (either explicitly or implicitly via :py:meth:`~xarray.DataArray.plot` for example). The exceptions to this
laziness are operations whose output shape is data-dependent, such as when calling :py:meth:`~xarray.DataArray.where`.
45 changes: 45 additions & 0 deletions doc/internals/interoperability.rst
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really cool page!

Original file line number Diff line number Diff line change
@@ -0,0 +1,45 @@
.. _interoperability:

Interoperability of Xarray
==========================

Xarray is designed to be extremely interoperable, in many orthogonal ways.
Making xarray as flexible as possible is the common theme of most of the goals on our development :ref:`roadmap`.
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

This interoperability comes via a set of flexible abstractions into which the user can plug in. The current full list is:

- :ref:`Custom file backends <add_a_backend>` via the :py:class:`~xarray.backends.BackendEntrypoint` system,
- Numpy-like :ref:`"duck" array wrapping <internals.duckarrays>`, which supports the `Python Array API Standard <https://data-apis.org/array-api/latest/>`_,
- :ref:`Chunked distributed array computation <internals.chunkedarrays>` via the :py:class:`~xarray.core.parallelcompat.ChunkManagerEntrypoint` system,
- Custom :py:class:`~xarray.Index` objects for :ref:`flexible label-based lookups <internals.custom indexes>`,
- Extending xarray objects with domain-specific methods via :ref:`custom accessors <internals.accessors>`.

.. warning::

One obvious way in which xarray could be more flexible is that whilst subclassing xarray objects is possible, we
generally advise against it, instead recommending composition over inheritance. See the
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved
:ref:`internal design page <internal design.subclassing>` and `GH issue <https://github.com/pydata/xarray/issues/3980>`_
for more details.
TomNicholas marked this conversation as resolved.
Show resolved Hide resolved

.. note::

If you think there is another way in which xarray could become more generically flexible then please
tell us your ideas by `raising an issue to request the feature <https://github.com/pydata/xarray/issues/new/choose>`_!


Whilst xarray was originally designed specifically to open ``netCDF4`` files as :py:class:`numpy.ndarray` objects labelled by :py:class:`pandas.Index` objects,
it is entirely possible today to:

- lazily open an xarray object directly from a custom binary file format (e.g. using ``xarray.open_dataset(path, engine='my_custom_format')``,
- handle the data as any API-compliant numpy-like array type (e.g. sparse or GPU-backed),
- distribute out-of-core computation across that array type in parallel (e.g. via :ref:`dask`),
- track the physical units of the data through computations (e.g via `pint-xarray <https://pint-xarray.readthedocs.io/en/stable/>`_),
- query the data via custom index logic optimized for specific applications (e.g. an :py:class:`~xarray.Index` object backed by a KDTree structure),
- attach domain-specific logic via accessor methods (e.g. to understand geographic Coordinate Reference System metadata),
- organize hierarchical groups of xarray data in a :py:class:`~datatree.DataTree` (e.g. to treat heterogenous simulation and observational data together during analysis).

All of these features can be provided simultaneously, using libaries compatible with the rest of the scientific python ecosystem.
In this situation xarray would be essentially a thin wrapper acting as pure-python framework, providing a common interface and
separation of concerns via various domain-agnostic abstractions.

Most of the remaining pages in the documentation of xarray's internals describe these various types of interoperability in more detail.
31 changes: 0 additions & 31 deletions doc/internals/variable-objects.rst

This file was deleted.

4 changes: 4 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,10 @@ Bug fixes
Documentation
~~~~~~~~~~~~~

- Added page on the internal design of xarray objects.
(:pull:`7991`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Added page on the interoperability of xarray objects.
(:pull:`7992`) By `Tom Nicholas <https://github.com/TomNicholas>`_.
- Add docstrings for the :py:class:`Index` base class and add some documentation on how to
create custom, Xarray-compatible indexes (:pull:`6975`)
By `Benoît Bovy <https://github.com/benbovy>`_.
Expand Down