Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache rasterio example files #4102

Merged
merged 58 commits into from
Mar 24, 2021
Merged
Show file tree
Hide file tree
Changes from 57 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
a8b0022
add a open_rasterio function to tutorial
keewis May 27, 2020
3997c27
put the cache directory into .cache if that exists
keewis May 27, 2020
b45e28d
raise an error if the status code is not 200
keewis May 27, 2020
e972d84
use the cached file if possible
keewis May 27, 2020
7b5e6d5
add a test to check that the caching does not affect the result
keewis May 27, 2020
e435103
use the new tutorial function in the visualization gallery
keewis May 27, 2020
75dac85
Merge branch 'master' into refactor-tutorial
keewis Jun 24, 2020
8fe0c92
fix the temporary directory creation
keewis Jun 27, 2020
783baeb
rewrite open_dataset to use the same functions as open_rasterio
keewis Jun 27, 2020
d6623ff
make sure the context manager on a pathlib object always works
keewis Jun 27, 2020
f834883
require requests
keewis Jun 28, 2020
5ddadab
add requests to most CI
keewis Jun 28, 2020
21f090d
split into two context managers
keewis Jun 29, 2020
068b660
use is_dir instead of exists to check for .cache
keewis Jun 29, 2020
c5d1490
reword a few comments
keewis Jun 29, 2020
b457c19
properly credit the SO answer
keewis Jun 29, 2020
c9faac1
add a pseudo-atomic open function
keewis Jun 29, 2020
f16a15f
add a random part to the file so concurrent calls are not an issue
keewis Jun 29, 2020
f9abf31
vendor appdirs.user_cache_dir and use it to determine the default cache
keewis Jun 30, 2020
c276876
properly vendor appdirs
keewis Jun 30, 2020
fdb9b11
suppress FileNotFoundErrors while removing a file
keewis Jun 30, 2020
427b4cf
silence mypy
keewis Jun 30, 2020
bd2cafc
make sure to convert string paths to pathlib.Path
keewis Jun 30, 2020
9d4bce3
convert the result of appdirs.user_cache_dir to pathlib.Path
keewis Jun 30, 2020
4297920
add the comment about switching to unlink(missing_ok=True)
keewis Jun 30, 2020
9fde5d6
Merge branch 'master' into refactor-tutorial
keewis Jun 30, 2020
adfce27
use requests.codes.ok instead of the numeric value
keewis Jul 16, 2020
ea9d4dc
remove the md5 checking code
keewis Jul 16, 2020
d828720
Merge branch 'master' into refactor-tutorial
keewis Jul 16, 2020
745de23
try to make the comment clearer
keewis Jul 16, 2020
c1229ae
Merge branch 'master' into refactor-tutorial
keewis Jul 26, 2020
29b78ed
typo
keewis Jul 26, 2020
8d04bba
isort
keewis Jul 26, 2020
dd08972
Merge branch 'master' into refactor-tutorial
keewis Nov 6, 2020
77e2487
remove all code related to the detection of the application directory
keewis Dec 9, 2020
e8b4a00
Merge branch 'master' into refactor-tutorial
keewis Dec 9, 2020
f730a85
Merge branch 'master' into refactor-tutorial
keewis Jan 12, 2021
39e7d77
use pooch for caching and fetching the files
keewis Jan 15, 2021
9134d7a
remove requests from the CI environments
keewis Jan 15, 2021
fa89822
add pooch to the environment used by the py38-flaky CI
keewis Jan 15, 2021
56d7e49
remove the install_requires on requests and the vendor mypy ignore [s…
keewis Jan 15, 2021
b848c66
add pooch to the doc environment [skip-ci]
keewis Jan 15, 2021
dab96de
Merge branch 'master' into refactor-tutorial
keewis Mar 20, 2021
405a90e
ignore missing type hints for pooch
keewis Mar 20, 2021
cc6cade
Merge branch 'master' into refactor-tutorial
keewis Mar 22, 2021
817a6ba
add a mapping of external urls
keewis Mar 23, 2021
0b60eb0
remove tutorial.open_rasterio
keewis Mar 23, 2021
d82b816
remove the github_url and branch PRs
keewis Mar 23, 2021
719765c
allow opening rasterio files using open_dataset
keewis Mar 23, 2021
0a489c4
remove the reference to xarray.tutorial.open_dataset
keewis Mar 23, 2021
0335473
rename engine_overrides to overrides
keewis Mar 23, 2021
cc3eb3c
update the docstring
keewis Mar 23, 2021
04e15fb
update the rasterio test
keewis Mar 23, 2021
6f02db4
use explicitly passed values for engine
keewis Mar 23, 2021
653e5fd
use open_dataset instead of open_rasterio
keewis Mar 23, 2021
b250b38
convert back to a data array [skip-ci]
keewis Mar 23, 2021
78c11f5
write the files to a temporary cache directory
keewis Mar 23, 2021
b532879
typo
keewis Mar 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions ci/requirements/doc.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dependencies:
- numba
- numpy>=1.17
- pandas>=1.0
- pooch
- pip
- pydata-sphinx-theme>=0.4.3
- rasterio>=1.1
Expand Down
1 change: 1 addition & 0 deletions ci/requirements/environment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,7 @@ dependencies:
- pandas
- pint
- pip=20.2
- pooch
- pre-commit
- pseudonetcdf
- pydap
Expand Down
6 changes: 2 additions & 4 deletions doc/examples/visualization_gallery.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -209,8 +209,7 @@
"metadata": {},
"outputs": [],
"source": [
"url = 'https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif'\n",
"da = xr.open_rasterio(url)\n",
"da = xr.tutorial.open_dataset(\"RGB.byte\").data\n",
"\n",
"# The data is in UTM projection. We have to set it manually until\n",
"# https://github.com/SciTools/cartopy/issues/813 is implemented\n",
Expand Down Expand Up @@ -246,8 +245,7 @@
"from rasterio.warp import transform\n",
"import numpy as np\n",
"\n",
"url = 'https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif'\n",
"da = xr.open_rasterio(url)\n",
"da = xr.tutorial.open_dataset(\"RGB.byte\").data\n",
"\n",
"# Compute the lon/lat coordinates with rasterio.warp.transform\n",
"ny, nx = len(da['y']), len(da['x'])\n",
Expand Down
3 changes: 3 additions & 0 deletions setup.cfg
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,8 @@ ignore_missing_imports = True
ignore_missing_imports = True
[mypy-pint.*]
ignore_missing_imports = True
[mypy-pooch.*]
ignore_missing_imports = True
[mypy-PseudoNetCDF.*]
ignore_missing_imports = True
[mypy-pydap.*]
Expand All @@ -233,6 +235,7 @@ ignore_missing_imports = True
[mypy-xarray.core.pycompat]
ignore_errors = True


[aliases]
test = pytest

Expand Down
27 changes: 16 additions & 11 deletions xarray/tests/test_tutorial.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,4 @@
import os
from contextlib import suppress

import pytest

Expand All @@ -13,20 +12,26 @@ class TestLoadDataset:
@pytest.fixture(autouse=True)
def setUp(self):
self.testfile = "tiny"
self.testfilepath = os.path.expanduser(
os.sep.join(("~", ".xarray_tutorial_data", self.testfile))
)
with suppress(OSError):
os.remove(f"{self.testfilepath}.nc")
with suppress(OSError):
os.remove(f"{self.testfilepath}.md5")

def test_download_from_github(self):

def test_download_from_github(self, tmp_path, monkeypatch):
monkeypatch.setenv("XDG_CACHE_DIR", os.fspatch(tmp_path))

ds = tutorial.open_dataset(self.testfile).load()
tiny = DataArray(range(5), name="tiny").to_dataset()
assert_identical(ds, tiny)

def test_download_from_github_load_without_cache(self):
def test_download_from_github_load_without_cache(self, tmp_path, monkeypatch):
monkeypatch.setenv("XDG_CACHE_DIR", os.fspatch(tmp_path))

ds_nocache = tutorial.open_dataset(self.testfile, cache=False).load()
ds_cache = tutorial.open_dataset(self.testfile).load()
assert_identical(ds_cache, ds_nocache)

def test_download_rasterio_from_github_load_without_cache(
self, tmp_path, monkeypatch
):
monkeypatch.setenv("XDG_CACHE_DIR", os.fspatch(tmp_path))

ds_nocache = tutorial.open_dataset("RGB.byte", cache=False).load()
ds_cache = tutorial.open_dataset("RGB.byte", cache=True).load()
assert_identical(ds_cache, ds_nocache)
121 changes: 67 additions & 54 deletions xarray/tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,33 +5,45 @@
* building tutorials in the documentation.
"""
import hashlib
import os as _os
from urllib.request import urlretrieve
import os
import pathlib

import numpy as np

from .backends.api import open_dataset as _open_dataset
from .backends.rasterio_ import open_rasterio
from .core.dataarray import DataArray
from .core.dataset import Dataset

_default_cache_dir = _os.sep.join(("~", ".xarray_tutorial_data"))

def _open_rasterio(path, engine=None, **kwargs):
data = open_rasterio(path, **kwargs)
name = data.name if data.name is not None else "data"
return data.to_dataset(name=name)

def file_md5_checksum(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
hash_md5.update(f.read())
return hash_md5.hexdigest()

_default_cache_dir_name = "xarray_tutorial_data"
base_url = "https://github.com/pydata/xarray-data"
version = "master"


external_urls = {
"RGB.byte": (
"rasterio",
"https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we switch to versioned data (i.e. set version to a tag) we might want to do that for this url, too. Then, we should also be able to specify hash values and have pooch check them automatically.

),
}
overrides = {
"rasterio": _open_rasterio,
}


# idea borrowed from Seaborn
def open_dataset(
name,
engine=None,
cache=True,
cache_dir=_default_cache_dir,
github_url="https://github.com/pydata/xarray-data",
branch="master",
cache_dir=None,
**kws,
):
"""
Expand All @@ -42,61 +54,62 @@ def open_dataset(
Parameters
----------
name : str
Name of the file containing the dataset. If no suffix is given, assumed
to be netCDF ('.nc' is appended)
Name of the file containing the dataset.
e.g. 'air_temperature'
cache_dir : str, optional
engine : str, optional
The engine to use.
cache_dir : path-like, optional
The directory in which to search for and write cached data.
cache : bool, optional
If True, then cache data locally for use on subsequent calls
github_url : str
Github repository where the data is stored
branch : str
The git branch to download from
kws : dict, optional
Passed to xarray.open_dataset
Notes
-----
Available datasets:
* ``"air_temperature"``
* ``"rasm"``
* ``"ROMS_example"``
* ``"tiny"``
* ``"era5-2mt-2019-03-uk.grib"``
* ``"RGB.byte"``: example rasterio file from https://github.com/mapbox/rasterio
Comment on lines +72 to +77
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if Notes is the right section, this could also be part of the function's description.

It would also be good to add a small description of the data.

See Also
--------
xarray.open_dataset
"""
root, ext = _os.path.splitext(name)
if not ext:
ext = ".nc"
fullname = root + ext
longdir = _os.path.expanduser(cache_dir)
localfile = _os.sep.join((longdir, fullname))
md5name = fullname + ".md5"
md5file = _os.sep.join((longdir, md5name))

if not _os.path.exists(localfile):

# This will always leave this directory on disk.
# May want to add an option to remove it.
if not _os.path.isdir(longdir):
_os.mkdir(longdir)

url = "/".join((github_url, "raw", branch, fullname))
urlretrieve(url, localfile)
url = "/".join((github_url, "raw", branch, md5name))
urlretrieve(url, md5file)

localmd5 = file_md5_checksum(localfile)
with open(md5file) as f:
remotemd5 = f.read()
if localmd5 != remotemd5:
_os.remove(localfile)
msg = """
MD5 checksum does not match, try downloading dataset again.
"""
raise OSError(msg)

ds = _open_dataset(localfile, **kws)

try:
import pooch
except ImportError:
raise ImportError("using the tutorial data requires pooch")

if isinstance(cache_dir, pathlib.Path):
cache_dir = os.fspath(cache_dir)
elif cache_dir is None:
cache_dir = pooch.os_cache(_default_cache_dir_name)

if name in external_urls:
engine_, url = external_urls[name]
if engine is None:
engine = engine_
else:
# process the name
default_extension = ".nc"
path = pathlib.Path(name)
if not path.suffix:
path = path.with_suffix(default_extension)

url = f"{base_url}/raw/{version}/{path.name}"

_open = overrides.get(engine, _open_dataset)
# retrieve the file
filepath = pooch.retrieve(url=url, known_hash=None, path=cache_dir)
ds = _open(filepath, engine=engine, **kws)
if not cache:
ds = ds.load()
_os.remove(localfile)
pathlib.Path(filepath).unlink()

return ds

Expand Down