Skip to content

Commit

Permalink
xarray.backends refactor (pydata#2261)
Browse files Browse the repository at this point in the history
* WIP: xarray.backends.file_manager for managing file objects.

This is intended to replace both PickleByReconstructionWrapper and
DataStorePickleMixin with something more compartmentalized.

xref GH2121

* Switch rasterio to use FileManager

* lint fixes

* WIP: rewrite FileManager to always use an LRUCache

* Test coverage

* Don't use move_to_end

* minor clarification

* Switch FileManager.acquire() to a method

* Python 2 compat

* Update xarray.set_options() to add file_cache_maxsize and validation

* Add assert for FILE_CACHE.maxsize

* More docstring for FileManager

* Add accidentally omited tests for LRUCache

* Adapt scipy backend to use FileManager

* Stickler fix

* Fix failure on Python 2.7

* Finish adjusting backends to use FileManager

* Fix bad import

* WIP on distributed

* More WIP

* Fix distributed write tests

* Fixes

* Minor fixup

* whats new

* More refactoring: remove state from backends entirely

* Cleanup

* Fix failing in-memory datastore tests

* Fix inaccessible datastore

* fix autoclose warnings

* Fix PyNIO failures

* No longer disable HDF5 file locking

We longer need to explicitly HDF5_USE_FILE_LOCKING='FALSE' because we
properly close open files.

* whats new and default file cache size

* Whats new tweak

* Refactor default lock logic to backend classes

* Rename get_resource_lock -> get_write_lock

* Don't acquire unnecessary locks in __getitem__

* Fix bad merge

* Fix import

* Remove unreachable code
  • Loading branch information
shoyer authored Oct 9, 2018
1 parent 5b4d160 commit 289b377
Show file tree
Hide file tree
Showing 28 changed files with 1,496 additions and 983 deletions.
1 change: 1 addition & 0 deletions asv_bench/asv.conf.json
Original file line number Diff line number Diff line change
Expand Up @@ -64,6 +64,7 @@
"scipy": [""],
"bottleneck": ["", null],
"dask": [""],
"distributed": [""],
},


Expand Down
41 changes: 41 additions & 0 deletions asv_bench/benchmarks/dataset_io.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,7 @@
from __future__ import absolute_import, division, print_function

import os

import numpy as np
import pandas as pd

Expand All @@ -14,6 +16,9 @@
pass


os.environ['HDF5_USE_FILE_LOCKING'] = 'FALSE'


class IOSingleNetCDF(object):
"""
A few examples that benchmark reading/writing a single netCDF file with
Expand Down Expand Up @@ -405,3 +410,39 @@ def time_open_dataset_scipy_with_time_chunks(self):
with dask.set_options(get=dask.multiprocessing.get):
xr.open_mfdataset(self.filenames_list, engine='scipy',
chunks=self.time_chunks)


def create_delayed_write():
import dask.array as da
vals = da.random.random(300, chunks=(1,))
ds = xr.Dataset({'vals': (['a'], vals)})
return ds.to_netcdf('file.nc', engine='netcdf4', compute=False)


class IOWriteNetCDFDask(object):
timeout = 60
repeat = 1
number = 5

def setup(self):
requires_dask()
self.write = create_delayed_write()

def time_write(self):
self.write.compute()


class IOWriteNetCDFDaskDistributed(object):
def setup(self):
try:
import distributed
except ImportError:
raise NotImplementedError
self.client = distributed.Client()
self.write = create_delayed_write()

def cleanup(self):
self.client.shutdown()

def time_write(self):
self.write.compute()
3 changes: 3 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -624,3 +624,6 @@ arguments for the ``from_store`` and ``dump_to_store`` Dataset methods:
backends.H5NetCDFStore
backends.PydapDataStore
backends.ScipyDataStore
backends.FileManager
backends.CachingFileManager
backends.DummyFileManager
19 changes: 16 additions & 3 deletions doc/whats-new.rst
Original file line number Diff line number Diff line change
Expand Up @@ -33,14 +33,27 @@ v0.11.0 (unreleased)
Breaking changes
~~~~~~~~~~~~~~~~

- Xarray's storage backends now automatically open and close files when
necessary, rather than requiring opening a file with ``autoclose=True``. A
global least-recently-used cache is used to store open files; the default
limit of 128 open files should suffice in most cases, but can be adjusted if
necessary with
``xarray.set_options(file_cache_maxsize=...)``. The ``autoclose`` argument
to ``open_dataset`` and related functions has been deprecated and is now a
no-op.

This change, along with an internal refactor of xarray's storage backends,
should significantly improve performance when reading and writing
netCDF files with Dask, especially when working with many files or using
Dask Distributed. By `Stephan Hoyer <https://github.com/shoyer>`_

Documentation
~~~~~~~~~~~~~
- Reduction of :py:meth:`DataArray.groupby` and :py:meth:`DataArray.resample`
without dimension argument will change in the next release.
Now we warn a FutureWarning.
By `Keisuke Fujii <https://github.com/fujiisoup>`_.

Documentation
~~~~~~~~~~~~~

Enhancements
~~~~~~~~~~~~

Expand Down
4 changes: 4 additions & 0 deletions xarray/backends/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
formats. They should not be used directly, but rather through Dataset objects.
"""
from .common import AbstractDataStore
from .file_manager import FileManager, CachingFileManager, DummyFileManager
from .memory import InMemoryDataStore
from .netCDF4_ import NetCDF4DataStore
from .pydap_ import PydapDataStore
Expand All @@ -15,6 +16,9 @@

__all__ = [
'AbstractDataStore',
'FileManager',
'CachingFileManager',
'DummyFileManager',
'InMemoryDataStore',
'NetCDF4DataStore',
'PydapDataStore',
Expand Down
Loading

0 comments on commit 289b377

Please sign in to comment.