-
Notifications
You must be signed in to change notification settings - Fork 83
Reuse previously downloaded files in download-data.sh #301
Reuse previously downloaded files in download-data.sh #301
Conversation
If the previous release folder exists, check the MD5 of files in that folder against the new md5 manifest. If the old files match, hard link them into the new folder. Then proceed to download any remaining files not present. Also, don't redownload reference files if present. (They should probably be checked as well, but are not currently in the manifest.) Final md5 check is still performed for all files.
Also, check symlinks at the end.
Thanks for doing this @jashapiro! I ran this on Ubuntu just now, both with and without the previous release present, and it behaved as expected/described. |
I've got this:
For what it's worth, the only downside here is that it does download every file again, which is already the current behavior. Edit: OS X Catalina |
@jashapiro testing on my macbook now and will let you know |
The md5sum that I have installed does not support |
I am on the vardict MAF hump right now (I last had v8 data)... still going. |
It doesn't seem to be necessary, as we will get a non-zero return code anyway if any files changed.
I updated the download script using your latest edits, @jashapiro but still have mismatches. Should I perform the steps @cgreene recommends?
MAC OSX MOJAVE |
Those seem like they are likely download failures unrelated to the changes in the script. Delete those files and try running it again? |
that worked, all good now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple folks have tested this and I have no additional concerns at the moment 👍
Purpose/implementation Section
What scientific question is your analysis addressing?
None. But downloads of new releases can be slow, so lets make that quicker when the underlying data hasn't changed.
What was your approach?
After downloading the new file list/md5 hashes, check those against the previous release files (if the immediately previous release exists). If the files are found and are unchanged as determined by md5, then hard link them into the new release folder. Then continue to download any missing files.
What GitHub issue does your pull request address?
Closes #233
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
This is a significant change to the download script, so it should be tested pretty extensively. It seems to work for me on macOS, but it would be good to test other systems as well, as well as combinations of present and absent files that I might not have seen.
Note that I removed
-z
time checks in favor of simple presence checks, as I couldn't think of an instance where a file might exist and be incorrect if the md5 had just been checked. If there is a reason to keep the-z
check, I suppose an option would be totouch
all the hard linked files to set their modification dates as if they had just been downloaded, rather than skipping them outright.I also added skipping redownload of reference files. These are currently unchecked by md5, but downloading them repeatedly seemed wasteful, as they should never change.