Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decorators for registering custom accessors in xarray #806

Merged
merged 2 commits into from
May 13, 2016
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -115,5 +115,5 @@ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.

xarray includes portions of pandas, NumPy and Seaborn. Their licenses are
included in the licenses directory.
xarray includes portions of pandas, NumPy, Seaborn and Python itself. These
licenses are included in the licenses directory.
39 changes: 23 additions & 16 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -411,22 +411,6 @@ DataArray methods
DataArray.load
DataArray.chunk

Backends (experimental)
-----------------------

These backends provide a low-level interface for lazily loading data from
external file-formats or protocols, and can be manually invoked to create
arguments for the ``from_store`` and ``dump_to_store`` Dataset methods.

.. autosummary::
:toctree: generated/

backends.NetCDF4DataStore
backends.H5NetCDFStore
backends.PydapDataStore
backends.ScipyDataStore


Plotting
========

Expand All @@ -441,3 +425,26 @@ Plotting
plot.line
plot.pcolormesh
plot.FacetGrid

Advanced API
============

.. autosummary::
:toctree: generated/

Variable
Coordinate
register_dataset_accessor
register_dataarray_accessor

These backends provide a low-level interface for lazily loading data from
external file-formats or protocols, and can be manually invoked to create
arguments for the ``from_store`` and ``dump_to_store`` Dataset methods:

.. autosummary::
:toctree: generated/

backends.NetCDF4DataStore
backends.H5NetCDFStore
backends.PydapDataStore
backends.ScipyDataStore
2 changes: 2 additions & 0 deletions doc/computation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -150,6 +150,8 @@ Finally, we can manually iterate through ``Rolling`` objects:
for label, arr_window in r:
# arr_window is a view of x

.. _compute.broadcasting:

Broadcasting by dimension name
==============================

Expand Down
22 changes: 22 additions & 0 deletions doc/examples/_code/accessor_example.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
import xarray as xr

@xr.register_dataset_accessor('geo')
class GeoAccessor(object):
def __init__(self, xarray_obj):
self._obj = xarray_obj
self._center = None

@property
def center(self):
"""Return the geographic center point of this dataset."""
if self._center is None:
# we can use a cache on our accessor objects, because accessors
# themselves are cached on instances that access them.
lon = self._obj.latitude
lat = self._obj.longitude
self._center = (float(lon.mean()), float(lat.mean()))
return self._center

def plot(self):
"""Plot data on a map."""
return 'plotting!'
3 changes: 2 additions & 1 deletion doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -35,6 +35,7 @@ Documentation
.. toctree::
:maxdepth: 1

whats-new
why-xarray
examples
installing
Expand All @@ -51,7 +52,7 @@ Documentation
plotting
api
faq
whats-new
internals

See also
--------
Expand Down
2 changes: 2 additions & 0 deletions doc/installing.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _installing:

Installation
============

Expand Down
138 changes: 138 additions & 0 deletions doc/internals.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,138 @@
..

xarray Internals
================

.. currentmodule:: xarray

xarray builds upon two of the foundational libraries of the scientific Python
stack, NumPy and pandas. It is written in pure Python (no C or Cython
extensions), which makes it easy to develop and extend. Instead, we push
compiled code to :ref:`optional dependencies<installing>`.

Variable objects
----------------

The core internal data structure in xarray is the :py:class:`~xarray.Variable`,
which is used as the basic building block behind xarray's
:py:class:`~xarray.Dataset` and :py:class:`~xarray.DataArray` types. A
``Variable`` consists of:

- ``dims``: A tuple of dimension names.
- ``data``: The N-dimensional array (typically, a NumPy or Dask array) storing
the Variable's data. It must have the same number of dimensions as the length
of ``dims``.
- ``attrs``: An ordered dictionary of metadata associated with this array. By
convention, xarray's built-in operations never use this metadata.
- ``encoding``: Another ordered dictionary used to store information about how
these variable's data is represented on disk. See :ref:`io.encoding` for more
details.

``Variable`` has an interface similar to NumPy arrays, but extended to make use
of named dimensions. For example, it uses ``dim`` in preference to an ``axis``
argument for methods like ``mean``, and supports :ref:`compute.broadcasting`.

However, unlike ``Dataset`` and ``DataArray``, the basic ``Variable`` does not
include coordinate labels along each axis.

``Variable`` is public API, but because of its incomplete support for labeled
data, it is mostly intended for advanced uses, such as in xarray itself or for
writing new backends. You can access the variable objects that correspond to
xarray objects via the (readonly) :py:attr:`Dataset.variables
<xarray.Dataset.variables>` and
:py:attr:`DataArray.variable <xarray.DataArray.variable>` attributes.

Extending xarray
----------------

.. ipython:: python
:suppress:

import numpy as np
import pandas as pd
import xarray as xr
np.random.seed(123456)

xarray is designed as a general purpose library, and hence tries to avoid
including overly domain specific methods. But inevitably, the need for more
domain specific logic arises.

One standard solution to this problem is to subclass Dataset and/or DataArray to
add domain specific functionality. However, inheritance is not very robust. It's
easy to inadvertently use internal APIs when subclassing, which means that your
code may break when xarray upgrades. Furthermore, many builtin methods will
only return native xarray objects.

The standard advice is to use `composition over inheritance`__, but
reimplementing an API as large as xarray's on your own objects can be an onerous
task, even if most methods are only forwarding to xarray implementations.

__ https://github.com/pydata/xarray/issues/706

To resolve this dilemma, xarray has the experimental
:py:func:`~xarray.register_dataset_accessor` and
:py:func:`~xarray.register_dataarray_accessor` decorators for adding custom
"accessors" on xarray objects. Here's how you might use these decorators to
write a custom "geo" accessor implementing a geography specific extension to
xarray:

.. literalinclude:: examples/_code/accessor_example.py

This achieves the same result as if the ``Dataset`` class had a cached property
defined that returns an instance of your class:

.. python::

class Dataset:
...
@property
def geo(self)
return GeoAccessor(self)

However, using the register accessor decorators is preferable to simply adding
your own ad-hoc property (i.e., ``Dataset.geo = property(...)``), for two
reasons:

1. It ensures that the name of your property does not conflict with any other
attributes or methods.
2. Instances of accessor object will be cached on the xarray object that creates
them. This means you can save state on them (e.g., to cache computed
properties).

Back in an interactive IPython session, we can use these properties:

.. ipython:: python
:suppress:

exec(open("examples/_code/accessor_example.py").read())

.. ipython:: python

ds = xr.Dataset({'longitude': np.linspace(0, 10),
'latitude': np.linspace(0, 20)})
ds.geo.center
ds.geo.plot()

The intent here is that libraries that extend xarray could add such an accessor
to implement subclass specific functionality rather than using actual subclasses
or patching in a large number of domain specific methods.

To help users keep things straight, please `let us know
<https://github.com/pydata/xarray/issues>`_ if you plan to write a new accessor
for an open source library. In the future, we will maintain a list of accessors
and the libraries that implement them on this page.

Here are several existing libraries that build functionality upon xarray.
They may be useful points of reference for your work:

- `xgcm <http://xgcm.readthedocs.org/>`_: General Circulation Model
Postprocessing. Uses subclassing and custom xarray backends.
- `PyGDX <http://pygdx.readthedocs.org/en/latest/>`_: Python 3 package for
accessing data stored in GAMS Data eXchange (GDX) files. Also uses a custom
subclass.
- `windspharm <http://ajdawson.github.io/windspharm/index.html>`_: Spherical
harmonic wind analysis in Python.
- `eofs <http://ajdawson.github.io/eofs/>`_: EOF analysis in Python.

.. TODO: consider adding references to these projects somewhere more prominent
.. in the documentation? maybe the FAQ page?
16 changes: 8 additions & 8 deletions doc/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -97,13 +97,11 @@ string, e.g., to access subgroup 'bar' within group 'foo' pass
pass ``mode='a'`` to ``to_netcdf`` to ensure that each call does not delete the
file.

Data is loaded lazily from netCDF files. You can manipulate, slice and subset
Data is always loaded lazily from netCDF files. You can manipulate, slice and subset
Dataset and DataArray objects, and no array values are loaded into memory until
you try to perform some sort of actual computation. For an example of how these
lazy arrays work, see the OPeNDAP section below.

.. todo: clarify this WRT dask.array

It is important to note that when you modify values of a Dataset, even one
linked to files on disk, only the in-memory copy you are manipulating in xarray
is modified: the original file on disk is never touched.
Expand All @@ -124,11 +122,13 @@ netCDF file. However, it's often cleaner to use a ``with`` statement:
with xr.open_dataset('saved_on_disk.nc') as ds:
print(ds.keys())

.. Although xarray provides reasonable support for incremental reads of files on
disk, it does not yet support incremental writes, which is important for
dealing with datasets that do not fit into memory. This is a significant
shortcoming that we hope to resolve (:issue:`199`) by adding the ability to
create ``Dataset`` objects directly linked to a netCDF file on disk.
Although xarray provides reasonable support for incremental reads of files on
disk, it does not support incremental writes, which can be a useful strategy
for dealing with datasets too big to fit into memory. Instead, xarray integrates
with dask.array (see :ref:`dask`), which provides a fully featured engine for
streaming computation.

.. _io.encoding:

Reading encoded data
~~~~~~~~~~~~~~~~~~~~
Expand Down
5 changes: 5 additions & 0 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,11 @@ Enhancements
attributes are retained in the resampled object. By
`Jeremy McGibbon <https://github.com/mcgibbon>`_.

- New (experimental) decorators :py:func:`~xarray.register_dataset_accessor` and
:py:func:`~xarray.register_dataarray_accessor` for registering custom xarray
extensions without subclassing. They are described in the new documentation
page on :ref:`internals`. By `Stephan Hoyer <https://github.com/shoyer>`

Bug fixes
~~~~~~~~~

Expand Down
2 changes: 2 additions & 0 deletions xarray/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from .core.alignment import align, broadcast, broadcast_arrays
from .core.combine import concat, auto_combine
from .core.extensions import (register_dataarray_accessor,
register_dataset_accessor)
from .core.variable import Variable, Coordinate
from .core.dataset import Dataset
from .core.dataarray import DataArray
Expand Down
10 changes: 5 additions & 5 deletions xarray/core/dataarray.py
Original file line number Diff line number Diff line change
Expand Up @@ -1376,19 +1376,19 @@ def imag(self):

def dot(self, other):
"""Perform dot product of two DataArrays along their shared dims.

Equivalent to taking taking tensordot over all shared dims.

Parameters
----------
other : DataArray
The other array with which the dot product is performed.

Returns
-------
result : DataArray
Array resulting from the dot product over all shared dimensions.

See also
--------
np.tensordot(a, b, axes)
Expand All @@ -1397,10 +1397,10 @@ def dot(self, other):
--------

>>> da_vals = np.arange(6 * 5 * 4).reshape((6, 5, 4))
>>> da = DataArray(da_vals, dims=['x', 'y', 'z'])
>>> da = DataArray(da_vals, dims=['x', 'y', 'z'])
>>> dm_vals = np.arange(4)
>>> dm = DataArray(dm_vals, dims=['z'])

>>> dm.dims
('z')
>>> da.dims
Expand Down
Loading