Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cache rasterio example files #4102
cache rasterio example files #4102
Changes from 10 commits
a8b0022
3997c27
b45e28d
e972d84
7b5e6d5
e435103
75dac85
8fe0c92
783baeb
d6623ff
f834883
5ddadab
21f090d
068b660
c5d1490
b457c19
c9faac1
f16a15f
f9abf31
c276876
fdb9b11
427b4cf
bd2cafc
9d4bce3
4297920
9fde5d6
adfce27
ea9d4dc
d828720
745de23
c1229ae
29b78ed
8d04bba
dd08972
77e2487
e8b4a00
f730a85
39e7d77
9134d7a
fa89822
56d7e49
b848c66
dab96de
405a90e
cc6cade
817a6ba
0b60eb0
d82b816
719765c
0a489c4
0335473
cc3eb3c
04e15fb
6f02db4
653e5fd
b250b38
78c11f5
b532879
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I usually find it a little easier to read when you nest context managers:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, although that has the disadvantage of an additional indentation level and it suggests that the order in which the context managers are entered is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is currently done? should this comment be removed then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is, if
cache_dir
was created usingTemporaryDirectory
, using it as a context manager deletes the directory at the end of the block (if it is apathlib
object, we can't do I/O with it after the block). So I guess the comment is just not accurate?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like I'm missing something -- what does using a pathlib.Path object as a context manager do?
I don't think it automatically deletes the files after using it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no, of course not.
cache_dir
can be either apathlib.Path
instance or aTemporaryDirectory
instance. Using aPath
object as a context manager will "close" it so you can't use it for I/O anymore (unlink
,read_text
, etc. won't work anymore) while aTemporaryDirectory
instance will delete the temporary directory it created.Not sure if there's a way to make that easier to read.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason why the MD5 hash is checked below is to guard against incomplete / failed downloads. This is actually a pretty common scenario, e.g., if a user hits Ctrl+C because downloading is taking too long (e.g., if they have a poor internet connection). So I think this would be important to include for rasterio as well.
That said, I'm not sure we really need to use MD5 verification, which is rather annoying to setup and requires downloading extra files. Instead, I would suggest ensuring that
download_to
always deletes the downloaded file if an exception is raised, e.g., based on https://stackoverflow.com/a/27045091/809705:This would probably work in about 90% of cases.
There may be a few others where a computer is shut down suddenly in the middle to writing a file. To handle this, it's probably a good idea to try to do an atomic write:
https://stackoverflow.com/a/2333979/809705
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rasterio
doesn't provide precomputed checksums so we'd have to go with the atomic write.The idea from that answer (which is actually pretty common,
rsync
uses that, too) is to write to a temporary file in the same directory and to rename once the write was successful.The disadvantage is that if the program was forcibly interrupted (e.g. using
SIGKILL
) this will leave the temporary file behind which will have to be cleaned up manually. Not sure if it's worth it, but we could warn if there are any stale files that don't have open file descriptors.Edit: we could also make this deliberately not suitable for concurrency (and thus much less complex) by using a deterministic name. That way we would download to a different file (
.<name>
, maybe?) while making sure the download itself doesn't fail and afterwards rename to the final name. If we try again after a failed attempt (i.e. the rename did not complete) we will redownload, overwriting.<name>
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I implemented the "atomic" write with a deterministic temporary name, which (as long as we properly detect something like disconnected sockets / broken pipes) should guard against incomplete writes and also avoid stale files. The only disadvantage I know of is that concurrent calls will interfere with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main concern here would that if two invocations of
tutorial.open_dataset
start at the same time (e.g., in different processes). Then it's possible that both could partially write to the same temporary file, which could get moved into place with an inconsistent state.I think it's probably a better idea to make the temporary file with a random name (like the atomicwrites package). In the worst case you end up with an extra file, but that's better than an bad state that requires manual intervention to resolve.
Another option is to vendor a copy of atomicwrites (e.g., as well as appdirs). It's a single file project so this would be relatively straightforward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure. I changed it to use
NamedTemporaryFile
to generate a unique file within the same directory and rename once it's done downloading. If the downloading succeeded, the rename will always work (not sure if that's the case on windows, though)