Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Reuse previously downloaded files in download-data.sh #301

Merged

Conversation

jashapiro
Copy link
Member

Purpose/implementation Section

What scientific question is your analysis addressing?

None. But downloads of new releases can be slow, so lets make that quicker when the underlying data hasn't changed.

What was your approach?

After downloading the new file list/md5 hashes, check those against the previous release files (if the immediately previous release exists). If the files are found and are unchanged as determined by md5, then hard link them into the new release folder. Then continue to download any missing files.

What GitHub issue does your pull request address?

Closes #233

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

This is a significant change to the download script, so it should be tested pretty extensively. It seems to work for me on macOS, but it would be good to test other systems as well, as well as combinations of present and absent files that I might not have seen.

Note that I removed -z time checks in favor of simple presence checks, as I couldn't think of an instance where a file might exist and be incorrect if the md5 had just been checked. If there is a reason to keep the -z check, I suppose an option would be to touch all the hard linked files to set their modification dates as if they had just been downloaded, rather than skipping them outright.

I also added skipping redownload of reference files. These are currently unchecked by md5, but downloading them repeatedly seemed wasteful, as they should never change.

yuankunzhu and others added 8 commits November 26, 2019 13:35
If the previous release folder exists, check the MD5 of files in that folder against the new md5 manifest. If the old files match, hard link them into the new folder. Then proceed to download any remaining files not present.

Also, don't redownload reference files if present. (They should probably be checked as well, but are not currently in the manifest.)

Final md5 check is still performed for all files.
Also, check symlinks at the end.
@jaclyn-taroni
Copy link
Member

Thanks for doing this @jashapiro! I ran this on Ubuntu just now, both with and without the previous release present, and it behaved as expected/described.

@cgreene
Copy link
Collaborator

cgreene commented Dec 2, 2019

I've got this:

Caseys-MacBook-Pro-2:release-v10-20191115 cgreene$ md5sum -c ../$RELEASE/md5sum.txt --ignore-missing
Error: --check <filename> cannot be used with additional files
Caseys-MacBook-Pro-2:release-v10-20191115 cgreene$ echo $PREVIOUS
release-v10-20191115
Caseys-MacBook-Pro-2:release-v10-20191115 cgreene$ echo $RELEASE
release-v11-20191126
Caseys-MacBook-Pro-2:release-v10-20191115 cgreene$

For what it's worth, the only downside here is that it does download every file again, which is already the current behavior.

Edit: OS X Catalina

@jharenza
Copy link
Collaborator

jharenza commented Dec 2, 2019

@jashapiro testing on my macbook now and will let you know

@cgreene
Copy link
Collaborator

cgreene commented Dec 2, 2019

The md5sum that I have installed does not support --ignore-missing, so that's what the error is referring to.

@cgreene
Copy link
Collaborator

cgreene commented Dec 2, 2019

@jharenza : I suspect you'll need to do the steps from #302 to get things working if you used our previous brew instructions to get md5sum

@jashapiro
Copy link
Member Author

@jharenza : I suspect you'll need to do the steps from #302 to get things working if you used our previous brew instructions to get md5sum

I may actually remove the --ignore-missing flag, as I have to catch the non-zero return code anyway. I think it will work the same without it.

@jharenza
Copy link
Collaborator

jharenza commented Dec 2, 2019

I am on the vardict MAF hump right now (I last had v8 data)... still going.

It doesn't seem to be necessary, as we will get a non-zero return code anyway if any files changed.
@jharenza
Copy link
Collaborator

jharenza commented Dec 2, 2019

I updated the download script using your latest edits, @jashapiro but still have mismatches. Should I perform the steps @cgreene recommends?

Checking MD5 hashes... pbta-histologies.tsv: OK WGS.hg38.lancet.300bp_padded.bed: OK WGS.hg38.lancet.unpadded.bed: OK WGS.hg38.mutect2.unpadded.bed: OK WGS.hg38.strelka2.unpadded.bed: OK WGS.hg38.vardict.100bp_padded.bed: OK WXS.hg38.100bp_padded.bed: OK StrexomeLite_hg38_liftover_100bp_padded.bed: OK StrexomeLite_Targets_CrossMap_hg38_filtered_chr_prefixed.bed: OK pbta-snv-mutect2.vep.maf.gz: OK pbta-snv-strelka2.vep.maf.gz: FAILED pbta-snv-lancet.vep.maf.gz: OK pbta-snv-vardict.vep.maf.gz: OK pbta-cnv-cnvkit.seg.gz: OK pbta-cnv-controlfreec.tsv.gz: OK pbta-sv-manta.tsv.gz: OK pbta-gene-expression-kallisto.polya.rds: OK pbta-gene-expression-kallisto.stranded.rds: OK pbta-gene-counts-rsem-expected_count.polya.rds: OK pbta-gene-counts-rsem-expected_count.stranded.rds: OK pbta-gene-expression-rsem-fpkm.polya.rds: OK pbta-gene-expression-rsem-fpkm.stranded.rds: OK pbta-isoform-counts-rsem-expected_count.polya.rds: OK pbta-isoform-counts-rsem-expected_count.stranded.rds: OK pbta-fusion-arriba.tsv.gz: OK pbta-fusion-starfusion.tsv.gz: OK independent-specimens.wgs.primary-plus.tsv: OK independent-specimens.wgs.primary.tsv: OK independent-specimens.wgswxs.primary-plus.tsv: OK independent-specimens.wgswxs.primary.tsv: OK pbta-gene-expression-rsem-tpm.polya.rds: OK pbta-gene-expression-rsem-tpm.stranded.rds: OK pbta-isoform-expression-rsem-tpm.polya.rds: OK pbta-isoform-expression-rsem-tpm.stranded.rds: OK pbta-fusion-putative-oncogenic.tsv: OK pbta-gene-expression-rsem-fpkm-collapsed.polya.rds: OK pbta-gene-expression-rsem-fpkm-collapsed.stranded.rds: FAILED pbta-snv-consensus-mutation-tmb.tsv: OK pbta-snv-consensus-mutation.maf.tsv.gz: OK md5sum: WARNING: 2 of 39 computed checksums did NOT match

MAC OSX MOJAVE

@jashapiro
Copy link
Member Author

I updated the download script using your latest edits, @jashapiro but still have mismatches

Those seem like they are likely download failures unrelated to the changes in the script. Delete those files and try running it again?

@jharenza
Copy link
Collaborator

jharenza commented Dec 2, 2019

that worked, all good now!

@jashapiro jashapiro marked this pull request as ready for review December 2, 2019 23:31
Copy link
Member

@jaclyn-taroni jaclyn-taroni left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple folks have tested this and I have no additional concerns at the moment 👍

@jaclyn-taroni jaclyn-taroni merged commit 956e569 into AlexsLemonade:master Dec 3, 2019
@jashapiro jashapiro deleted the jashapiro/download-faster branch April 11, 2021 18:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update download script to skip unchanged files in new releases
5 participants