cache rasterio example files #4102

keewis · 2020-05-27T22:43:04Z

To make building the documentation a bit faster (by 5-10 minutes for me), this adds a xr.tutorial.open_rasterio function with a signature and behavior that is almost identical to tutorial.open_dataset (when we replace the open_* functions with open_dataset(..., format="...") we should do that for the tutorial functions, too).

It uses requests.get instead of urllib.request.urlretrieve so that would be a new dependency. I'm not sure if that's an issue since it's installed in the bare-minimum CI's environment.

The tutorial.open_dataset code could be rewritten to use the same structure but I wanted to get feedback on open_rasterio first.

Edit: I also changed the default cache directory to ~/.cache/xarray_tutorial_data with a fallback to the old default if ~/.cache does not exist

Closes building the visualization gallery is slow #3986
Tests added (but that's just a comparison between cached and uncached)
Passes isort -rc . && black . && mypy . && flake8
Fully documented, including whats-new.rst for all changes and api.rst for new API

the exception class might not be the best choice, though.

jthielen · 2020-05-27T23:39:36Z

For these kind of cached files, would it be worth using something like Pooch for xarray?

keewis · 2020-06-01T12:48:13Z

using a existing library for that instead of writing our own caching / downloading functions does seem useful (the code there seems to be much more sophisticated than anything in the tutorial module), but we'd be introducing a new dependency: not sure how much of an issue that would be.

We could also make it a optional dependency, but that would mean that the tutorial module is not always available.

dcherian

Thanks @keewis . Just one minor question

xarray/tutorial.py

though the tests are only run on py38

shoyer · 2020-06-29T06:11:18Z

Edit: I also changed the default cache directory to ~/.cache/xarray_tutorial_data with a fallback to the old default if ~/.cache does not exist

I would suggest just creating the full directory tree, e.g., with os.makedirs(path, exist_ok=True). I don't think we want to let this be inconsistent.

xarray/tutorial.py

shoyer · 2020-06-29T06:19:18Z

xarray/tutorial.py

+
+
+def download_to(url, path):
+    with requests.get(url, stream=True) as r, path.open("wb") as f:


nit: I usually find it a little easier to read when you nest context managers:

Suggested change

with requests.get(url, stream=True) as r, path.open("wb") as f:

with requests.get(url, stream=True) as r

with path.open("wb") as f:

...

done, although that has the disadvantage of an additional indentation level and it suggests that the order in which the context managers are entered is important.

shoyer · 2020-06-29T06:20:09Z

xarray/tutorial.py

+        return _open_rasterio(path, **kws)
+
+    url = f"{github_url}/raw/{branch}/tests/data/{path.name}"
+    # make sure the directory is deleted afterwards


I don't think this is currently done? should this comment be removed then?

it is, if cache_dir was created using TemporaryDirectory, using it as a context manager deletes the directory at the end of the block (if it is a pathlib object, we can't do I/O with it after the block). So I guess the comment is just not accurate?

shoyer · 2020-06-29T06:40:20Z

xarray/tutorial.py

+    url = f"{github_url}/raw/{branch}/tests/data/{path.name}"
+    # make sure the directory is deleted afterwards
+    with cache_dir:
+        download_to(url, path)


The reason why the MD5 hash is checked below is to guard against incomplete / failed downloads. This is actually a pretty common scenario, e.g., if a user hits Ctrl+C because downloading is taking too long (e.g., if they have a poor internet connection). So I think this would be important to include for rasterio as well.

That said, I'm not sure we really need to use MD5 verification, which is rather annoying to setup and requires downloading extra files. Instead, I would suggest ensuring that download_to always deletes the downloaded file if an exception is raised, e.g., based on https://stackoverflow.com/a/27045091/809705:

def download_to(url, path): try: ... # current contents of `download_to` except Exception: # switch to path.unlink(missing_ok=True) for Python 3.8+ with contextlib.suppress(FileNotFoundError): path.unlink() raise

This would probably work in about 90% of cases.

There may be a few others where a computer is shut down suddenly in the middle to writing a file. To handle this, it's probably a good idea to try to do an atomic write:
https://stackoverflow.com/a/2333979/809705

rasterio doesn't provide precomputed checksums so we'd have to go with the atomic write.

The idea from that answer (which is actually pretty common, rsync uses that, too) is to write to a temporary file in the same directory and to rename once the write was successful.

The disadvantage is that if the program was forcibly interrupted (e.g. using SIGKILL) this will leave the temporary file behind which will have to be cleaned up manually. Not sure if it's worth it, but we could warn if there are any stale files that don't have open file descriptors.

Edit: we could also make this deliberately not suitable for concurrency (and thus much less complex) by using a deterministic name. That way we would download to a different file (.<name>, maybe?) while making sure the download itself doesn't fail and afterwards rename to the final name. If we try again after a failed attempt (i.e. the rename did not complete) we will redownload, overwriting .<name>.

I implemented the "atomic" write with a deterministic temporary name, which (as long as we properly detect something like disconnected sockets / broken pipes) should guard against incomplete writes and also avoid stale files. The only disadvantage I know of is that concurrent calls will interfere with each other.

My main concern here would that if two invocations of tutorial.open_dataset start at the same time (e.g., in different processes). Then it's possible that both could partially write to the same temporary file, which could get moved into place with an inconsistent state.

I think it's probably a better idea to make the temporary file with a random name (like the atomicwrites package). In the worst case you end up with an extra file, but that's better than an bad state that requires manual intervention to resolve.

Another option is to vendor a copy of atomicwrites (e.g., as well as appdirs). It's a single file project so this would be relatively straightforward.

sure. I changed it to use NamedTemporaryFile to generate a unique file within the same directory and rename once it's done downloading. If the downloading succeeded, the rename will always work (not sure if that's the case on windows, though)

keewis · 2020-06-29T11:45:39Z

I would suggest just creating the full directory tree, e.g., with os.makedirs(path, exist_ok=True). I don't think we want to let this be inconsistent.

When writing this I wanted to make sure we didn't force a ~/.cache folder on everyone, regardless of the platform.

After looking into this a bit more, it seems we would probably need to have different cache folders for different platforms:

on windows, the path is given by APPDATA (or LOCALAPPDATA, not sure) (I can't find a official resource for this)
on macos it can be either ~/Library/Caches (or ~/Library/Application Support?)
on unix (following the freedesktop basedir spec) it is indicated by XDG_CACHE_HOME with a default of ~/.cache.

So a currently released version of xarray will always put it into ~/.xarray_tutorial_data, and my changes will partially align with the freedesktop spec.

How closely should we follow those specifications?

Edit: appdirs solves that (but: it's a new dependency)

dcherian · 2020-06-29T16:00:15Z

Edit: appdirs solves that (but: it's a new dependency)

See #2815 :)

keewis · 2020-06-29T16:09:25Z

thanks for pointing me to that issue, @dcherian

shoyer · 2020-06-29T16:21:29Z

Edit: appdirs solves that (but: it's a new dependency)

See #2815 :)

As mentioned in that issue, we could definitely vendor a copy of appdirs if desired. Or we could pick some simple heuristic. In practice I think ~/.cache is fine. It's only standard on Linux, but I have a handful of directories there on my Mac laptop already.

keewis · 2020-06-30T00:19:46Z

I vendored appdirs.user_cache_dir, but I need help with mypy: the windows code uses ctypes.windll which is only available on windows

dcherian · 2021-03-20T15:51:26Z

Ping @andersy005 who has experience with pooch

shoyer · 2021-03-20T18:29:55Z

I don't think it's important to preserve the GitHub URL and branch options. I suspect those options exist strictly for testing new datasets.

I'm surprised that Pooch seems to be focused on data that is bundled in code repositories. It seems like that could easily lead to bloated repositories (though I guess xarray data is bigger than image data, for example).

shoyer · 2021-03-20T18:45:59Z

I suspect we could make Pooch work OK by tagging releases in the xarray-data repository. Then we can manually update the version number inside xarray whenever we add or update a dataset.

keewis

I decided to combine tutorial.open_dataset and tutorial.open_rasterio because that results in less duplicated code, but right now open_dataset does not support engine="rasterio" so this might be confusing.

keewis · 2021-03-23T16:59:30Z

xarray/tutorial.py

+external_urls = {
+    "RGB.byte": (
+        "rasterio",
+        "https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif",


once we switch to versioned data (i.e. set version to a tag) we might want to do that for this url, too. Then, we should also be able to specify hash values and have pooch check them automatically.

keewis · 2021-03-23T17:01:40Z

xarray/tutorial.py

+    * ``"air_temperature"``
+    * ``"rasm"``
+    * ``"ROMS_example"``
+    * ``"tiny"``
+    * ``"era5-2mt-2019-03-uk.grib"``
+    * ``"RGB.byte"``: example rasterio file from https://github.com/mapbox/rasterio


not sure if Notes is the right section, this could also be part of the function's description.

It would also be good to add a small description of the data.

shoyer

LGTM. I don't think we have a strong need for data versioning here, as long as we don't remove existing datasets.

The main use-case for hashing was verifying that a file downloaded successfully even if the network connection was interrupted, but pooch writes to a temp file and then moves it into place, which is probably a sufficiently robust solution for that.

this avoids having to modify the global cache dir *and* we don't need _default_cache_dir. This only works on linux, though, for windows and macOS we would need something else.

keewis · 2021-03-23T22:04:56Z

data versioning might be useful if we ever decide to change datasets: once we do, the examples would change as well, so people might not be able to exactly reproduce them. Not sure how much of an issue that is, though.

I fixed the failing tests, but the trick I used (point $XDG_CACHE_DIR to a temporary directory) only works on linux.

andersy005 · 2021-03-23T22:34:15Z

pooch writes to a temp file and then moves it into place, which is probably a sufficiently robust solution for that.

I concur. 👍🏽

andersy005 · 2021-03-24T17:56:44Z

Thank you for this refactor, @keewis!

keewis added 6 commits May 27, 2020 23:56

add a open_rasterio function to tutorial

a8b0022

put the cache directory into .cache if that exists

3997c27

raise an error if the status code is not 200

b45e28d

the exception class might not be the best choice, though.

use the cached file if possible

e972d84

add a test to check that the caching does not affect the result

7b5e6d5

use the new tutorial function in the visualization gallery

e435103

keewis added 4 commits June 24, 2020 22:27

Merge branch 'master' into refactor-tutorial

75dac85

fix the temporary directory creation

8fe0c92

rewrite open_dataset to use the same functions as open_rasterio

783baeb

make sure the context manager on a pathlib object always works

d6623ff

dcherian approved these changes Jun 28, 2020

View reviewed changes

xarray/tutorial.py Outdated Show resolved Hide resolved

keewis added 2 commits June 28, 2020 20:09

require requests

f834883

add requests to most CI

5ddadab

though the tests are only run on py38

shoyer reviewed Jun 29, 2020

View reviewed changes

keewis added 3 commits June 29, 2020 11:10

split into two context managers

21f090d

use is_dir instead of exists to check for .cache

068b660

reword a few comments

c5d1490

properly credit the SO answer

b457c19

keewis force-pushed the refactor-tutorial branch from 60a52be to b457c19 Compare June 29, 2020 11:47

add a pseudo-atomic open function

c9faac1

keewis added 2 commits June 30, 2020 01:36

add a random part to the file so concurrent calls are not an issue

f16a15f

vendor appdirs.user_cache_dir and use it to determine the default cache

f9abf31

ignore missing type hints for pooch

405a90e

keewis added 9 commits March 22, 2021 22:39

Merge branch 'master' into refactor-tutorial

cc6cade

add a mapping of external urls

817a6ba

remove tutorial.open_rasterio

0b60eb0

remove the github_url and branch PRs

d82b816

allow opening rasterio files using open_dataset

719765c

remove the reference to xarray.tutorial.open_dataset

0a489c4

rename engine_overrides to overrides

0335473

update the docstring

cc3eb3c

update the rasterio test

04e15fb

keewis commented Mar 23, 2021

View reviewed changes

keewis added 3 commits March 23, 2021 18:07

use explicitly passed values for engine

6f02db4

use open_dataset instead of open_rasterio

653e5fd

convert back to a data array [skip-ci]

b250b38

keewis force-pushed the refactor-tutorial branch from 87ce8b3 to b250b38 Compare March 23, 2021 21:24

shoyer approved these changes Mar 23, 2021

View reviewed changes

write the files to a temporary cache directory

78c11f5

this avoids having to modify the global cache dir *and* we don't need _default_cache_dir. This only works on linux, though, for windows and macOS we would need something else.

andersy005 approved these changes Mar 23, 2021

View reviewed changes

typo

b532879

andersy005 merged commit 8452120 into pydata:master Mar 24, 2021

keewis deleted the refactor-tutorial branch March 24, 2021 17:58

aulemahal mentioned this pull request Mar 24, 2021

Copied checksum from xarray Ouranosinc/xclim#672

Merged

6 tasks

keewis mentioned this pull request Mar 24, 2021

more tutorial refactoring #5074

Merged

7 tasks

joshmoore mentioned this pull request Nov 16, 2022

Zip sample files for reduced file listing zarr-developers/zarr-python#1246

Closed

1 task



		def download_to(url, path):
		with requests.get(url, stream=True) as r, path.open("wb") as f:

Uh oh!

cache rasterio example files #4102

cache rasterio example files #4102

Uh oh!

Conversation

keewis commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jthielen commented May 27, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

keewis commented Jun 1, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

shoyer commented Jun 29, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keewis Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keewis Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

keewis commented Jun 29, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dcherian commented Jun 29, 2020

Uh oh!

keewis commented Jun 29, 2020

Uh oh!

shoyer commented Jun 29, 2020

Uh oh!

keewis commented Jun 30, 2020

Uh oh!

dcherian commented Mar 20, 2021

Uh oh!

shoyer commented Mar 20, 2021

Uh oh!

shoyer commented Mar 20, 2021

Uh oh!

keewis left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

shoyer left a comment

Choose a reason for hiding this comment

Uh oh!

keewis commented Mar 23, 2021

Uh oh!

andersy005 commented Mar 23, 2021

Uh oh!

andersy005 commented Mar 24, 2021

Uh oh!

Uh oh!

keewis commented May 27, 2020 •

edited

Loading

jthielen commented May 27, 2020 •

edited

Loading

keewis commented Jun 1, 2020 •

edited

Loading

keewis Jun 29, 2020 •

edited

Loading

keewis Jun 29, 2020 •

edited

Loading

keewis commented Jun 29, 2020 •

edited

Loading