Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache rasterio example files #4102

Merged
merged 58 commits into from
Mar 24, 2021
Merged
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
58 commits
Select commit Hold shift + click to select a range
a8b0022
add a open_rasterio function to tutorial
keewis May 27, 2020
3997c27
put the cache directory into .cache if that exists
keewis May 27, 2020
b45e28d
raise an error if the status code is not 200
keewis May 27, 2020
e972d84
use the cached file if possible
keewis May 27, 2020
7b5e6d5
add a test to check that the caching does not affect the result
keewis May 27, 2020
e435103
use the new tutorial function in the visualization gallery
keewis May 27, 2020
75dac85
Merge branch 'master' into refactor-tutorial
keewis Jun 24, 2020
8fe0c92
fix the temporary directory creation
keewis Jun 27, 2020
783baeb
rewrite open_dataset to use the same functions as open_rasterio
keewis Jun 27, 2020
d6623ff
make sure the context manager on a pathlib object always works
keewis Jun 27, 2020
f834883
require requests
keewis Jun 28, 2020
5ddadab
add requests to most CI
keewis Jun 28, 2020
21f090d
split into two context managers
keewis Jun 29, 2020
068b660
use is_dir instead of exists to check for .cache
keewis Jun 29, 2020
c5d1490
reword a few comments
keewis Jun 29, 2020
b457c19
properly credit the SO answer
keewis Jun 29, 2020
c9faac1
add a pseudo-atomic open function
keewis Jun 29, 2020
f16a15f
add a random part to the file so concurrent calls are not an issue
keewis Jun 29, 2020
f9abf31
vendor appdirs.user_cache_dir and use it to determine the default cache
keewis Jun 30, 2020
c276876
properly vendor appdirs
keewis Jun 30, 2020
fdb9b11
suppress FileNotFoundErrors while removing a file
keewis Jun 30, 2020
427b4cf
silence mypy
keewis Jun 30, 2020
bd2cafc
make sure to convert string paths to pathlib.Path
keewis Jun 30, 2020
9d4bce3
convert the result of appdirs.user_cache_dir to pathlib.Path
keewis Jun 30, 2020
4297920
add the comment about switching to unlink(missing_ok=True)
keewis Jun 30, 2020
9fde5d6
Merge branch 'master' into refactor-tutorial
keewis Jun 30, 2020
adfce27
use requests.codes.ok instead of the numeric value
keewis Jul 16, 2020
ea9d4dc
remove the md5 checking code
keewis Jul 16, 2020
d828720
Merge branch 'master' into refactor-tutorial
keewis Jul 16, 2020
745de23
try to make the comment clearer
keewis Jul 16, 2020
c1229ae
Merge branch 'master' into refactor-tutorial
keewis Jul 26, 2020
29b78ed
typo
keewis Jul 26, 2020
8d04bba
isort
keewis Jul 26, 2020
dd08972
Merge branch 'master' into refactor-tutorial
keewis Nov 6, 2020
77e2487
remove all code related to the detection of the application directory
keewis Dec 9, 2020
e8b4a00
Merge branch 'master' into refactor-tutorial
keewis Dec 9, 2020
f730a85
Merge branch 'master' into refactor-tutorial
keewis Jan 12, 2021
39e7d77
use pooch for caching and fetching the files
keewis Jan 15, 2021
9134d7a
remove requests from the CI environments
keewis Jan 15, 2021
fa89822
add pooch to the environment used by the py38-flaky CI
keewis Jan 15, 2021
56d7e49
remove the install_requires on requests and the vendor mypy ignore [s…
keewis Jan 15, 2021
b848c66
add pooch to the doc environment [skip-ci]
keewis Jan 15, 2021
dab96de
Merge branch 'master' into refactor-tutorial
keewis Mar 20, 2021
405a90e
ignore missing type hints for pooch
keewis Mar 20, 2021
cc6cade
Merge branch 'master' into refactor-tutorial
keewis Mar 22, 2021
817a6ba
add a mapping of external urls
keewis Mar 23, 2021
0b60eb0
remove tutorial.open_rasterio
keewis Mar 23, 2021
d82b816
remove the github_url and branch PRs
keewis Mar 23, 2021
719765c
allow opening rasterio files using open_dataset
keewis Mar 23, 2021
0a489c4
remove the reference to xarray.tutorial.open_dataset
keewis Mar 23, 2021
0335473
rename engine_overrides to overrides
keewis Mar 23, 2021
cc3eb3c
update the docstring
keewis Mar 23, 2021
04e15fb
update the rasterio test
keewis Mar 23, 2021
6f02db4
use explicitly passed values for engine
keewis Mar 23, 2021
653e5fd
use open_dataset instead of open_rasterio
keewis Mar 23, 2021
b250b38
convert back to a data array [skip-ci]
keewis Mar 23, 2021
78c11f5
write the files to a temporary cache directory
keewis Mar 23, 2021
b532879
typo
keewis Mar 23, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 2 additions & 4 deletions doc/examples/visualization_gallery.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -209,8 +209,7 @@
"metadata": {},
"outputs": [],
"source": [
"url = 'https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif'\n",
"da = xr.open_rasterio(url)\n",
"da = xr.tutorial.open_rasterio(\"RGB.byte\")\n",
"\n",
"# The data is in UTM projection. We have to set it manually until\n",
"# https://github.com/SciTools/cartopy/issues/813 is implemented\n",
Expand Down Expand Up @@ -246,8 +245,7 @@
"from rasterio.warp import transform\n",
"import numpy as np\n",
"\n",
"url = 'https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif'\n",
"da = xr.open_rasterio(url)\n",
"da = xr.tutorial.open_rasterio(\"RGB.byte\")\n",
"\n",
"# Compute the lon/lat coordinates with rasterio.warp.transform\n",
"ny, nx = len(da['y']), len(da['x'])\n",
Expand Down
9 changes: 6 additions & 3 deletions xarray/tests/test_tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,9 +13,7 @@ class TestLoadDataset:
@pytest.fixture(autouse=True)
def setUp(self):
self.testfile = "tiny"
self.testfilepath = os.path.expanduser(
os.sep.join(("~", ".xarray_tutorial_data", self.testfile))
)
self.testfilepath = tutorial._default_cache_dir / self.testfile
with suppress(OSError):
os.remove(f"{self.testfilepath}.nc")
with suppress(OSError):
Expand All @@ -30,3 +28,8 @@ def test_download_from_github_load_without_cache(self):
ds_nocache = tutorial.open_dataset(self.testfile, cache=False).load()
ds_cache = tutorial.open_dataset(self.testfile).load()
assert_identical(ds_cache, ds_nocache)

def test_download_rasterio_from_github_load_without_cache(self):
ds_nocache = tutorial.open_rasterio("RGB.byte", cache=False).load()
ds_cache = tutorial.open_rasterio("RGB.byte", cache=True).load()
assert_identical(ds_cache, ds_nocache)
138 changes: 97 additions & 41 deletions xarray/tutorial.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,78 @@

"""
import hashlib
import os as _os
from urllib.request import urlretrieve
import pathlib
import shutil
import tempfile

import numpy as np
import requests
keewis marked this conversation as resolved.
Show resolved Hide resolved

from .backends.api import open_dataset as _open_dataset
from .backends.rasterio_ import open_rasterio as _open_rasterio
from .core.dataarray import DataArray
from .core.dataset import Dataset

_default_cache_dir = _os.sep.join(("~", ".xarray_tutorial_data"))
_cache_name = "xarray_tutorial_data"
_cache_dir = pathlib.Path.home() / ".cache"
# TODO: I/O on import. Might not be a good idea.
if _cache_dir.exists():
_default_cache_dir = _cache_dir / _cache_name
else:
_default_cache_dir = pathlib.Path.home() / f".{_cache_name}"


def file_md5_checksum(fname):
hash_md5 = hashlib.md5()
with open(fname, "rb") as f:
hash_md5.update(f.read())
return hash_md5.hexdigest()
def check_md5sum(content, checksum):
md5 = hashlib.md5()
md5.update(content)
md5sum = md5.hexdigest()

return md5sum == checksum


def download_to(url, path):
with requests.get(url, stream=True) as r, path.open("wb") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I usually find it a little easier to read when you nest context managers:

Suggested change
with requests.get(url, stream=True) as r, path.open("wb") as f:
with requests.get(url, stream=True) as r
with path.open("wb") as f:
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, although that has the disadvantage of an additional indentation level and it suggests that the order in which the context managers are entered is important.

if r.status_code != 200:
raise OSError(f"download failed: {r.reason}")

r.raw.decode_content = True
shutil.copyfileobj(r.raw, f)
keewis marked this conversation as resolved.
Show resolved Hide resolved


def open_rasterio(
name,
cache=True,
cache_dir=_default_cache_dir,
github_url="https://github.com/mapbox/rasterio",
branch="master",
**kws,
):
if not cache_dir.is_dir():
cache_dir.mkdir()

default_extension = ".tif"

if cache:
path = cache_dir / name
# need to always do that, otherwise the context manager might fail
cache_dir = pathlib.Path(cache_dir)
else:
cache_dir = tempfile.TemporaryDirectory()
path = pathlib.Path(cache_dir.name) / name

if not path.suffix:
path = path.with_suffix(default_extension)
elif path.suffix == ".byte":
path = path.with_name(name + default_extension)

if cache and path.is_file():
return _open_rasterio(path, **kws)

url = f"{github_url}/raw/{branch}/tests/data/{path.name}"
# make sure the directory is deleted afterwards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is currently done? should this comment be removed then?

Copy link
Collaborator Author

@keewis keewis Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, if cache_dir was created using TemporaryDirectory, using it as a context manager deletes the directory at the end of the block (if it is a pathlib object, we can't do I/O with it after the block). So I guess the comment is just not accurate?

with cache_dir:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like I'm missing something -- what does using a pathlib.Path object as a context manager do?

I don't think it automatically deletes the files after using it?

Copy link
Collaborator Author

@keewis keewis Jul 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, of course not. cache_dir can be either a pathlib.Path instance or a TemporaryDirectory instance. Using a Path object as a context manager will "close" it so you can't use it for I/O anymore (unlink, read_text, etc. won't work anymore) while a TemporaryDirectory instance will delete the temporary directory it created.

Not sure if there's a way to make that easier to read.

download_to(url, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the MD5 hash is checked below is to guard against incomplete / failed downloads. This is actually a pretty common scenario, e.g., if a user hits Ctrl+C because downloading is taking too long (e.g., if they have a poor internet connection). So I think this would be important to include for rasterio as well.

That said, I'm not sure we really need to use MD5 verification, which is rather annoying to setup and requires downloading extra files. Instead, I would suggest ensuring that download_to always deletes the downloaded file if an exception is raised, e.g., based on https://stackoverflow.com/a/27045091/809705:

def download_to(url, path):
    try:
        ...  # current contents of `download_to`
    except Exception:
        # switch to path.unlink(missing_ok=True) for Python 3.8+
        with contextlib.suppress(FileNotFoundError):
            path.unlink()
        raise

This would probably work in about 90% of cases.

There may be a few others where a computer is shut down suddenly in the middle to writing a file. To handle this, it's probably a good idea to try to do an atomic write:
https://stackoverflow.com/a/2333979/809705

Copy link
Collaborator Author

@keewis keewis Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rasterio doesn't provide precomputed checksums so we'd have to go with the atomic write.

The idea from that answer (which is actually pretty common, rsync uses that, too) is to write to a temporary file in the same directory and to rename once the write was successful.

The disadvantage is that if the program was forcibly interrupted (e.g. using SIGKILL) this will leave the temporary file behind which will have to be cleaned up manually. Not sure if it's worth it, but we could warn if there are any stale files that don't have open file descriptors.

Edit: we could also make this deliberately not suitable for concurrency (and thus much less complex) by using a deterministic name. That way we would download to a different file (.<name>, maybe?) while making sure the download itself doesn't fail and afterwards rename to the final name. If we try again after a failed attempt (i.e. the rename did not complete) we will redownload, overwriting .<name>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the "atomic" write with a deterministic temporary name, which (as long as we properly detect something like disconnected sockets / broken pipes) should guard against incomplete writes and also avoid stale files. The only disadvantage I know of is that concurrent calls will interfere with each other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here would that if two invocations of tutorial.open_dataset start at the same time (e.g., in different processes). Then it's possible that both could partially write to the same temporary file, which could get moved into place with an inconsistent state.

I think it's probably a better idea to make the temporary file with a random name (like the atomicwrites package). In the worst case you end up with an extra file, but that's better than an bad state that requires manual intervention to resolve.

Another option is to vendor a copy of atomicwrites (e.g., as well as appdirs). It's a single file project so this would be relatively straightforward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I changed it to use NamedTemporaryFile to generate a unique file within the same directory and rename once it's done downloading. If the downloading succeeded, the rename will always work (not sure if that's the case on windows, though)

return _open_rasterio(path, **kws)


# idea borrowed from Seaborn
Expand Down Expand Up @@ -61,44 +116,45 @@ def open_dataset(
xarray.open_dataset

"""
root, ext = _os.path.splitext(name)
if not ext:
ext = ".nc"
fullname = root + ext
longdir = _os.path.expanduser(cache_dir)
localfile = _os.sep.join((longdir, fullname))
md5name = fullname + ".md5"
md5file = _os.sep.join((longdir, md5name))

if not _os.path.exists(localfile):

# This will always leave this directory on disk.
# May want to add an option to remove it.
if not _os.path.isdir(longdir):
_os.mkdir(longdir)

url = "/".join((github_url, "raw", branch, fullname))
urlretrieve(url, localfile)
url = "/".join((github_url, "raw", branch, md5name))
urlretrieve(url, md5file)

localmd5 = file_md5_checksum(localfile)
with open(md5file, "r") as f:
remotemd5 = f.read()
if localmd5 != remotemd5:
_os.remove(localfile)

def construct_url(full_name):
return f"{github_url}/raw/{branch}/{full_name}"

if not cache_dir.is_dir():
cache_dir.mkdir()

default_extension = ".nc"

if cache:
path = cache_dir / name
# need to always do that, otherwise the context manager might fail
cache_dir = pathlib.Path(cache_dir)
else:
cache_dir = tempfile.TemporaryDirectory()
path = pathlib.Path(cache_dir.name) / name

if not path.suffix:
path = path.with_suffix(default_extension)

if cache and path.is_file():
return _open_dataset(path, **kws)

# make sure the directory is deleted afterwards if it was temporary
with cache_dir:
download_to(construct_url(path.name), path)

# verify the checksum (md5 guards only against transport corruption)
md5_path = path.with_name(path.name + ".md5")
download_to(construct_url(md5_path.name), md5_path)
if not check_md5sum(path.read_bytes(), md5_path.read_text()):
path.unlink()
md5_path.unlink()
msg = """
MD5 checksum does not match, try downloading dataset again.
"""
raise OSError(msg)
keewis marked this conversation as resolved.
Show resolved Hide resolved

ds = _open_dataset(localfile, **kws)

if not cache:
ds = ds.load()
_os.remove(localfile)

return ds
return _open_dataset(path, **kws)


def load_dataset(*args, **kwargs):
Expand Down