-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache rasterio example files #4102
Conversation
the exception class might not be the best choice, though.
For these kind of cached files, would it be worth using something like Pooch for xarray? |
using a existing library for that instead of writing our own caching / downloading functions does seem useful (the code there seems to be much more sophisticated than anything in the We could also make it a optional dependency, but that would mean that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @keewis . Just one minor question
though the tests are only run on py38
I would suggest just creating the full directory tree, e.g., with |
xarray/tutorial.py
Outdated
|
||
|
||
def download_to(url, path): | ||
with requests.get(url, stream=True) as r, path.open("wb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I usually find it a little easier to read when you nest context managers:
with requests.get(url, stream=True) as r, path.open("wb") as f: | |
with requests.get(url, stream=True) as r | |
with path.open("wb") as f: | |
... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, although that has the disadvantage of an additional indentation level and it suggests that the order in which the context managers are entered is important.
xarray/tutorial.py
Outdated
return _open_rasterio(path, **kws) | ||
|
||
url = f"{github_url}/raw/{branch}/tests/data/{path.name}" | ||
# make sure the directory is deleted afterwards |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is currently done? should this comment be removed then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is, if cache_dir
was created using TemporaryDirectory
, using it as a context manager deletes the directory at the end of the block (if it is a pathlib
object, we can't do I/O with it after the block). So I guess the comment is just not accurate?
xarray/tutorial.py
Outdated
url = f"{github_url}/raw/{branch}/tests/data/{path.name}" | ||
# make sure the directory is deleted afterwards | ||
with cache_dir: | ||
download_to(url, path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the MD5 hash is checked below is to guard against incomplete / failed downloads. This is actually a pretty common scenario, e.g., if a user hits Ctrl+C because downloading is taking too long (e.g., if they have a poor internet connection). So I think this would be important to include for rasterio as well.
That said, I'm not sure we really need to use MD5 verification, which is rather annoying to setup and requires downloading extra files. Instead, I would suggest ensuring that download_to
always deletes the downloaded file if an exception is raised, e.g., based on https://stackoverflow.com/a/27045091/809705:
def download_to(url, path):
try:
... # current contents of `download_to`
except Exception:
# switch to path.unlink(missing_ok=True) for Python 3.8+
with contextlib.suppress(FileNotFoundError):
path.unlink()
raise
This would probably work in about 90% of cases.
There may be a few others where a computer is shut down suddenly in the middle to writing a file. To handle this, it's probably a good idea to try to do an atomic write:
https://stackoverflow.com/a/2333979/809705
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rasterio
doesn't provide precomputed checksums so we'd have to go with the atomic write.
The idea from that answer (which is actually pretty common, rsync
uses that, too) is to write to a temporary file in the same directory and to rename once the write was successful.
The disadvantage is that if the program was forcibly interrupted (e.g. using SIGKILL
) this will leave the temporary file behind which will have to be cleaned up manually. Not sure if it's worth it, but we could warn if there are any stale files that don't have open file descriptors.
Edit: we could also make this deliberately not suitable for concurrency (and thus much less complex) by using a deterministic name. That way we would download to a different file (.<name>
, maybe?) while making sure the download itself doesn't fail and afterwards rename to the final name. If we try again after a failed attempt (i.e. the rename did not complete) we will redownload, overwriting .<name>
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the "atomic" write with a deterministic temporary name, which (as long as we properly detect something like disconnected sockets / broken pipes) should guard against incomplete writes and also avoid stale files. The only disadvantage I know of is that concurrent calls will interfere with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here would that if two invocations of tutorial.open_dataset
start at the same time (e.g., in different processes). Then it's possible that both could partially write to the same temporary file, which could get moved into place with an inconsistent state.
I think it's probably a better idea to make the temporary file with a random name (like the atomicwrites package). In the worst case you end up with an extra file, but that's better than an bad state that requires manual intervention to resolve.
Another option is to vendor a copy of atomicwrites (e.g., as well as appdirs). It's a single file project so this would be relatively straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I changed it to use NamedTemporaryFile
to generate a unique file within the same directory and rename once it's done downloading. If the downloading succeeded, the rename will always work (not sure if that's the case on windows, though)
When writing this I wanted to make sure we didn't force a After looking into this a bit more, it seems we would probably need to have different cache folders for different platforms:
So a currently released version of How closely should we follow those specifications? Edit: |
60a52be
to
b457c19
Compare
See #2815 :) |
thanks for pointing me to that issue, @dcherian |
As mentioned in that issue, we could definitely vendor a copy of appdirs if desired. Or we could pick some simple heuristic. In practice I think |
I vendored |
Ping @andersy005 who has experience with pooch |
I don't think it's important to preserve the GitHub URL and branch options. I suspect those options exist strictly for testing new datasets. I'm surprised that Pooch seems to be focused on data that is bundled in code repositories. It seems like that could easily lead to bloated repositories (though I guess xarray data is bigger than image data, for example). |
I suspect we could make Pooch work OK by tagging releases in the xarray-data repository. Then we can manually update the version number inside xarray whenever we add or update a dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I decided to combine tutorial.open_dataset
and tutorial.open_rasterio
because that results in less duplicated code, but right now open_dataset
does not support engine="rasterio"
so this might be confusing.
external_urls = { | ||
"RGB.byte": ( | ||
"rasterio", | ||
"https://github.com/mapbox/rasterio/raw/master/tests/data/RGB.byte.tif", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
once we switch to versioned data (i.e. set version
to a tag) we might want to do that for this url, too. Then, we should also be able to specify hash values and have pooch
check them automatically.
* ``"air_temperature"`` | ||
* ``"rasm"`` | ||
* ``"ROMS_example"`` | ||
* ``"tiny"`` | ||
* ``"era5-2mt-2019-03-uk.grib"`` | ||
* ``"RGB.byte"``: example rasterio file from https://github.com/mapbox/rasterio |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if Notes
is the right section, this could also be part of the function's description.
It would also be good to add a small description of the data.
87ce8b3
to
b250b38
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I don't think we have a strong need for data versioning here, as long as we don't remove existing datasets.
The main use-case for hashing was verifying that a file downloaded successfully even if the network connection was interrupted, but pooch writes to a temp file and then moves it into place, which is probably a sufficiently robust solution for that.
this avoids having to modify the global cache dir *and* we don't need _default_cache_dir. This only works on linux, though, for windows and macOS we would need something else.
data versioning might be useful if we ever decide to change datasets: once we do, the examples would change as well, so people might not be able to exactly reproduce them. Not sure how much of an issue that is, though. I fixed the failing tests, but the trick I used (point |
I concur. 👍🏽 |
Thank you for this refactor, @keewis! |
To make building the documentation a bit faster (by 5-10 minutes for me), this adds a
xr.tutorial.open_rasterio
function with a signature and behavior that is almost identical totutorial.open_dataset
(when we replace theopen_*
functions withopen_dataset(..., format="...")
we should do that for the tutorial functions, too).It uses
requests.get
instead ofurllib.request.urlretrieve
so that would be a new dependency. I'm not sure if that's an issue since it's installed in thebare-minimum
CI's environment.The
tutorial.open_dataset
code could be rewritten to use the same structure but I wanted to get feedback onopen_rasterio
first.Edit: I also changed the default cache directory to
~/.cache/xarray_tutorial_data
with a fallback to the old default if~/.cache
does not existisort -rc . && black . && mypy . && flake8
whats-new.rst
for all changes andapi.rst
for new API