Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Source tarballs are from now on archived #2194

Closed
bgruening opened this issue Aug 21, 2016 · 7 comments
Closed

Source tarballs are from now on archived #2194

bgruening opened this issue Aug 21, 2016 · 7 comments

Comments

@bgruening
Copy link
Member

@bioconda/all from now on we will mirror all source tarballs via The Cargo Port project https://depot.galaxyproject.org/software/

The mechanism is implemented in large parts in galaxyproject/cargo-port#93.

  • do a git checkout of bioconda-recipes
  • look for new meta.yml files during a fixed time period
  • extract the relevant urls and git_urls
  • downloads and stores the tarballs
  • The Galaxy Jenkins server will run this every day
  • all code is part of cargo-port

So whenever your tarball disappears and you would like to rebuild a recipe, go to cargo-port, get the new (archived) URL and update the recipe with this URL.
@daler this will make bioarchive not obsolete but we should be much saver from now on, even without using bioarchive and for all of your packages. We could even think about to automatically include a second URL in every recipe that points to The Cargo Port as fallback.

Remarks:

  • I consider the usage of git_url as bad-practice. It is a resource hug for github and it means more pain to mirror these packages. Essentially, we need to create tarballs out of it and store them, but can not give you a checksum to control what we did. For this reason I would encourage everyone to use url wherever possible, especially for github repositories.
  • We do check checksums during the mirror step, if they exists and we should enforce them more strictly, imho.
  • We haven't mirrored patches and such, as we thought they are part of the bioconda repo. We also saw a few recipes downloading source in buid.sh, this is also not covered and should be avoided, imho.

Thanks to @jxtx, @nekrut and @jgoecks and the Galaxy team for sponsoring the archive-space - and thanks to @erasche for his help to get this working and the weekend hack.

Sustainable conda packages ftw.

@kyleabeauchamp
Copy link
Contributor

Great work!

@johanneskoester
Copy link
Contributor

Björn, this is awesome, thank you!! I would vote for automatically adding the cargo url to the package after the source has been backed up.

daler added a commit that referenced this issue Aug 22, 2016
@daler
Copy link
Member

daler commented Aug 22, 2016

This is fantastic. @jxtx, @nekrut, @jgoecks, much thanks for the storage space.

Agreed, cargo urls should definitely be added automatically. Should probably append to the original URL so we have a record of where cargo port got the tarball in the first place.

@bgruening: if a meta.yaml updates source/md5/sha256 and the build number is bumped (but not the version number), is the cargo port url updated? Hopefully this shouldn't happen much, but curious to know how it's handled.

daler added a commit that referenced this issue Aug 22, 2016
@bgruening
Copy link
Member Author

@daler this is currently not covered. It's mainly a problem of the storage organisation. If you have an idea how to structure the depot? I guess you need to put the hash in the directory structure or something like that. But than again what happens with packages without a checksum?

@daler I think this would perfectly fit into bioconda-utils :)

@daler
Copy link
Member

daler commented Aug 22, 2016

@bgruening If cargo port were a conda-specific depot, then using build numbers in the directory structure would work. Otherwise, updating the cargo port tarball upon checksum collision with meta.yaml would delete the original stored tarball. Bad news for reproducibility. So I agree, including checksum in directory (or maybe just in the basename) would help. But maybe we just need to be careful on the bioconda side to enforce what we can and avoid these kinds of issues.

For example we should definitely enforce checksums on all recipes. Any cases you know of where that would not be possible?

Sounds like we need a linter module in bioconda-utils . . .

@bgruening
Copy link
Member Author

We should definitely be more strict in using checksums as mentioned above I'm all in. Please add it to this list: #1860

For example we should definitely enforce checksums on all recipes. Any cases you know of where that would not be possible?

All github/bitbucket URLs. Even worse if you calculate the checksum from on of these, it can change in the future for the same commit. I think they creating the archives on the fly and as soon as they change the underlying zlib library the checksum will change.

Sounds like we need a linter module in bioconda-utils . . .
Indeed, please see #1860 and also have a look at the conda-forge smithy tool. They to a lot of these linting already.

Thanks @daler!

@bgruening bgruening mentioned this issue Aug 22, 2016
6 tasks
@bgruening
Copy link
Member Author

More than 50 tarballs are already mirrored: https://depot.galaxyproject.org/software/

I will close this Issue now! Please remember that we have this feature, we should check it after restructuring the repo and such things, maybe we break it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants