Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cache rasterio example files #4102

Merged
merged 58 commits into from
Mar 24, 2021
Merged

Conversation

keewis
Copy link
Collaborator

@keewis keewis commented May 27, 2020

To make building the documentation a bit faster (by 5-10 minutes for me), this adds a xr.tutorial.open_rasterio function with a signature and behavior that is almost identical to tutorial.open_dataset (when we replace the open_* functions with open_dataset(..., format="...") we should do that for the tutorial functions, too).

It uses requests.get instead of urllib.request.urlretrieve so that would be a new dependency. I'm not sure if that's an issue since it's installed in the bare-minimum CI's environment.

The tutorial.open_dataset code could be rewritten to use the same structure but I wanted to get feedback on open_rasterio first.

Edit: I also changed the default cache directory to ~/.cache/xarray_tutorial_data with a fallback to the old default if ~/.cache does not exist

  • Closes building the visualization gallery is slow #3986
  • Tests added (but that's just a comparison between cached and uncached)
  • Passes isort -rc . && black . && mypy . && flake8
  • Fully documented, including whats-new.rst for all changes and api.rst for new API

@jthielen
Copy link
Contributor

jthielen commented May 27, 2020

For these kind of cached files, would it be worth using something like Pooch for xarray?

@keewis
Copy link
Collaborator Author

keewis commented Jun 1, 2020

using a existing library for that instead of writing our own caching / downloading functions does seem useful (the code there seems to be much more sophisticated than anything in the tutorial module), but we'd be introducing a new dependency: not sure how much of an issue that would be.

We could also make it a optional dependency, but that would mean that the tutorial module is not always available.

Copy link
Contributor

@dcherian dcherian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @keewis . Just one minor question

xarray/tutorial.py Outdated Show resolved Hide resolved
though the tests are only run on py38
@shoyer
Copy link
Member

shoyer commented Jun 29, 2020

Edit: I also changed the default cache directory to ~/.cache/xarray_tutorial_data with a fallback to the old default if ~/.cache does not exist

I would suggest just creating the full directory tree, e.g., with os.makedirs(path, exist_ok=True). I don't think we want to let this be inconsistent.

xarray/tutorial.py Outdated Show resolved Hide resolved


def download_to(url, path):
with requests.get(url, stream=True) as r, path.open("wb") as f:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I usually find it a little easier to read when you nest context managers:

Suggested change
with requests.get(url, stream=True) as r, path.open("wb") as f:
with requests.get(url, stream=True) as r
with path.open("wb") as f:
...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, although that has the disadvantage of an additional indentation level and it suggests that the order in which the context managers are entered is important.

return _open_rasterio(path, **kws)

url = f"{github_url}/raw/{branch}/tests/data/{path.name}"
# make sure the directory is deleted afterwards
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is currently done? should this comment be removed then?

Copy link
Collaborator Author

@keewis keewis Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is, if cache_dir was created using TemporaryDirectory, using it as a context manager deletes the directory at the end of the block (if it is a pathlib object, we can't do I/O with it after the block). So I guess the comment is just not accurate?

url = f"{github_url}/raw/{branch}/tests/data/{path.name}"
# make sure the directory is deleted afterwards
with cache_dir:
download_to(url, path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why the MD5 hash is checked below is to guard against incomplete / failed downloads. This is actually a pretty common scenario, e.g., if a user hits Ctrl+C because downloading is taking too long (e.g., if they have a poor internet connection). So I think this would be important to include for rasterio as well.

That said, I'm not sure we really need to use MD5 verification, which is rather annoying to setup and requires downloading extra files. Instead, I would suggest ensuring that download_to always deletes the downloaded file if an exception is raised, e.g., based on https://stackoverflow.com/a/27045091/809705:

def download_to(url, path):
    try:
        ...  # current contents of `download_to`
    except Exception:
        # switch to path.unlink(missing_ok=True) for Python 3.8+
        with contextlib.suppress(FileNotFoundError):
            path.unlink()
        raise

This would probably work in about 90% of cases.

There may be a few others where a computer is shut down suddenly in the middle to writing a file. To handle this, it's probably a good idea to try to do an atomic write:
https://stackoverflow.com/a/2333979/809705

Copy link
Collaborator Author

@keewis keewis Jun 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rasterio doesn't provide precomputed checksums so we'd have to go with the atomic write.

The idea from that answer (which is actually pretty common, rsync uses that, too) is to write to a temporary file in the same directory and to rename once the write was successful.

The disadvantage is that if the program was forcibly interrupted (e.g. using SIGKILL) this will leave the temporary file behind which will have to be cleaned up manually. Not sure if it's worth it, but we could warn if there are any stale files that don't have open file descriptors.

Edit: we could also make this deliberately not suitable for concurrency (and thus much less complex) by using a deterministic name. That way we would download to a different file (.<name>, maybe?) while making sure the download itself doesn't fail and afterwards rename to the final name. If we try again after a failed attempt (i.e. the rename did not complete) we will redownload, overwriting .<name>.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented the "atomic" write with a deterministic temporary name, which (as long as we properly detect something like disconnected sockets / broken pipes) should guard against incomplete writes and also avoid stale files. The only disadvantage I know of is that concurrent calls will interfere with each other.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here would that if two invocations of tutorial.open_dataset start at the same time (e.g., in different processes). Then it's possible that both could partially write to the same temporary file, which could get moved into place with an inconsistent state.

I think it's probably a better idea to make the temporary file with a random name (like the atomicwrites package). In the worst case you end up with an extra file, but that's better than an bad state that requires manual intervention to resolve.

Another option is to vendor a copy of atomicwrites (e.g., as well as appdirs). It's a single file project so this would be relatively straightforward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure. I changed it to use NamedTemporaryFile to generate a unique file within the same directory and rename once it's done downloading. If the downloading succeeded, the rename will always work (not sure if that's the case on windows, though)

@keewis
Copy link
Collaborator Author

keewis commented Jun 29, 2020

I would suggest just creating the full directory tree, e.g., with os.makedirs(path, exist_ok=True). I don't think we want to let this be inconsistent.

When writing this I wanted to make sure we didn't force a ~/.cache folder on everyone, regardless of the platform.

After looking into this a bit more, it seems we would probably need to have different cache folders for different platforms:

  • on windows, the path is given by APPDATA (or LOCALAPPDATA, not sure) (I can't find a official resource for this)
  • on macos it can be either ~/Library/Caches (or ~/Library/Application Support?)
  • on unix (following the freedesktop basedir spec) it is indicated by XDG_CACHE_HOME with a default of ~/.cache.

So a currently released version of xarray will always put it into ~/.xarray_tutorial_data, and my changes will partially align with the freedesktop spec.

How closely should we follow those specifications?

Edit: appdirs solves that (but: it's a new dependency)

@dcherian
Copy link
Contributor

Edit: appdirs solves that (but: it's a new dependency)

See #2815 :)

@keewis
Copy link
Collaborator Author

keewis commented Jun 29, 2020

thanks for pointing me to that issue, @dcherian

@shoyer
Copy link
Member

shoyer commented Jun 29, 2020

Edit: appdirs solves that (but: it's a new dependency)

See #2815 :)

As mentioned in that issue, we could definitely vendor a copy of appdirs if desired. Or we could pick some simple heuristic. In practice I think ~/.cache is fine. It's only standard on Linux, but I have a handful of directories there on my Mac laptop already.

@keewis
Copy link
Collaborator Author

keewis commented Jun 30, 2020

I vendored appdirs.user_cache_dir, but I need help with mypy: the windows code uses ctypes.windll which is only available on windows

@dcherian
Copy link
Contributor

Ping @andersy005 who has experience with pooch

@shoyer
Copy link
Member

shoyer commented Mar 20, 2021

I don't think it's important to preserve the GitHub URL and branch options. I suspect those options exist strictly for testing new datasets.

I'm surprised that Pooch seems to be focused on data that is bundled in code repositories. It seems like that could easily lead to bloated repositories (though I guess xarray data is bigger than image data, for example).

@shoyer
Copy link
Member

shoyer commented Mar 20, 2021

I suspect we could make Pooch work OK by tagging releases in the xarray-data repository. Then we can manually update the version number inside xarray whenever we add or update a dataset.

Copy link
Collaborator Author

@keewis keewis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I decided to combine tutorial.open_dataset and tutorial.open_rasterio because that results in less duplicated code, but right now open_dataset does not support engine="rasterio" so this might be confusing.

external_urls = {
"RGB.byte": (
"rasterio",
"https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif",
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

once we switch to versioned data (i.e. set version to a tag) we might want to do that for this url, too. Then, we should also be able to specify hash values and have pooch check them automatically.

Comment on lines +69 to +74
* ``"air_temperature"``
* ``"rasm"``
* ``"ROMS_example"``
* ``"tiny"``
* ``"era5-2mt-2019-03-uk.grib"``
* ``"RGB.byte"``: example rasterio file from https://github.com/mapbox/rasterio
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if Notes is the right section, this could also be part of the function's description.

It would also be good to add a small description of the data.

Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I don't think we have a strong need for data versioning here, as long as we don't remove existing datasets.

The main use-case for hashing was verifying that a file downloaded successfully even if the network connection was interrupted, but pooch writes to a temp file and then moves it into place, which is probably a sufficiently robust solution for that.

this avoids having to modify the global cache dir *and* we don't need
_default_cache_dir. This only works on linux, though, for windows and macOS we would
need something else.
@keewis
Copy link
Collaborator Author

keewis commented Mar 23, 2021

data versioning might be useful if we ever decide to change datasets: once we do, the examples would change as well, so people might not be able to exactly reproduce them. Not sure how much of an issue that is, though.

I fixed the failing tests, but the trick I used (point $XDG_CACHE_DIR to a temporary directory) only works on linux.

@andersy005
Copy link
Member

pooch writes to a temp file and then moves it into place, which is probably a sufficiently robust solution for that.

I concur. 👍🏽

@andersy005
Copy link
Member

Thank you for this refactor, @keewis!

@andersy005 andersy005 merged commit 8452120 into pydata:master Mar 24, 2021
@keewis keewis deleted the refactor-tutorial branch March 24, 2021 17:58
@keewis keewis mentioned this pull request Mar 24, 2021
7 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

building the visualization gallery is slow
5 participants