Skip to content

Conversation

Louwrensth
Copy link
Contributor

By using gzip --no-name we omit including filenames and timestamps, thus the checksum of the resulting tarball should be constant.

@boegel boegel added the change label Jan 18, 2019
@boegel boegel added this to the 3.8.1 milestone Jan 18, 2019
@boegel
Copy link
Member

boegel commented Jan 18, 2019

@Louwrensth Thanks a lot for your contribution!

I guess this change can help with getting the exact same source tarballs, but I'm not sure it's sufficient to ensure that checksums are always the same. I suspect different version of tar or gzip are already sufficient. See also https://stackoverflow.com/questions/52668432/tar-package-has-different-checksum-for-exactly-the-same-content for example .

Also, is there a version constraint on the gzip version that supports this option?

@mboisson Thoughts on this?

@Louwrensth
Copy link
Contributor Author

@boegel You are right, this PR doesn't catch all. I found this list of known issues regarding creating reproducible tarballs, seems useful: https://wiki.debian.org/ReproducibleBuilds/Howto#Identified_problems.2C_and_possible_solutions

I'm willing to make a start and go through these known solutions for tar, unless you are more interested in dropping the use of tar altogether: Perhaps it makes more sense to rely on something like git rev-parse HEAD to produce the checksum in the context of a git repository. But I guess we'll need the tarball for other uses in EB...

Version constraint for gzip --no-name seems to be 1.2:
https://git.savannah.gnu.org/cgit/gzip.git/tree/ChangeLog-2007#n1273

@boegel
Copy link
Member

boegel commented Jan 18, 2019

Long term the easier/more portable solution would certainly be to stick to using tar cfvz to create the tarball, and compute the checksum on the contents (excl. .git directory?) rather than the tarball itself.

That's harder to implement though, since right now EasyBuild always verifies the checksum on the source tarballs themselves, not the unpacked directory that results from it...

W.r.t. gzip --no-name: supported since gzip 1.2 which was released in 1993, I'm going to assume that's OK... ;)

@boegel boegel modified the milestones: 3.8.1, 3.x Jan 18, 2019
@mboisson
Copy link
Contributor

@boegel I am mostly oblivious to the checksum problems with tarballs, so I am afraid I can be of no help. I don't have an opinion on the topic.

@boegel boegel added the bug fix label Jan 18, 2019
@Louwrensth Louwrensth force-pushed the 2726_checksum_not_constant_with_git_config branch from aa0cfa5 to 699f683 Compare January 19, 2019 05:22
@Louwrensth
Copy link
Contributor Author

I've been playing with various modifications of this command (after https://wiki.debian.org/ReproducibleBuilds)

find testrep -print0 | LC_ALL=C sort -z | tar --no-recursion --null -T - --format=ustar --mtime=0 --owner=0 --group=0 --numeric-owner -cf testrepo.tar 

And it is reproducible (also if adding gzip --no-name) , but the tar somehow differs a bit on different filesystems :(

Then I tried to use git archive on the temporary repository. Seems reproducible on different filesystems! (I tried tmpfs, ext4, nfs4.) :)

But git archive doesn't automatically traverse submodules, so I run git submodule foreach and concatenate the tar files into one. Fingers crossed this works.

@boegel
Copy link
Member

boegel commented Jan 19, 2019

@Louwrensth Even git archive still doesn't give hard guarantees that the exact same tarball will be generated on different systems though, since just a different version of git is sufficient to cause trouble? See also easybuilders/easybuild-easyconfigs#5151 for the fallout that happened after GitHub updated to a newer version of Git...

@Louwrensth
Copy link
Contributor Author

@boegel : Thanks for the comment.

Regarding git version constraints: no issue afaik. I have tried it with git 2.10.1 and git 1.8.3.1 with identical results. I can look into the git history on how git archive might have changed in time.

Regarding github's changes: no issue afaik. Because the trick is not to run git archive directly on the github remote (that would dependency on the git server's archive implementation -- github does not support it now it seems anyway), but to run git archive in the cloned temporary local repository.

@Louwrensth Louwrensth force-pushed the 2726_checksum_not_constant_with_git_config branch from 7c75bd0 to c6b1aa6 Compare January 21, 2019 04:02
@Louwrensth Louwrensth force-pushed the 2726_checksum_not_constant_with_git_config branch from c6b1aa6 to 171139d Compare January 21, 2019 06:07
)

Replace 'tar' by 'git archive' as it seems more reproducible.
Use 'gzip --no-name' to omit timestamps.
@Louwrensth Louwrensth force-pushed the 2726_checksum_not_constant_with_git_config branch from 171139d to 123f7bc Compare January 21, 2019 06:12
@Louwrensth
Copy link
Contributor Author

Okay, now Travis and the Hound are happy :)

@boegel I looked into the history of git archive --format=tar and found that it has been pretty stable for years now. An interesting commit in 2014 (after this I stopped going back in time) reveals that the developers are concerned with keeping it bit-for-bit equivalent with older versions. Combined with my test of git archive with git 1.8.3.1 (2013), I think we're safe.

@Louwrensth
Copy link
Contributor Author

@boegel Please let me know if there is anything left you want to discuss.

@Louwrensth
Copy link
Contributor Author

@boegel: I take it you're very busy, but still I let you know that this PR is ready for your final review and merge.

@easybuilders easybuilders deleted a comment from boegelbot Mar 8, 2019
@easybuilders easybuilders deleted a comment from boegelbot Mar 8, 2019
@easybuilders easybuilders deleted a comment from boegelbot Mar 8, 2019
@easybuilders easybuilders deleted a comment from boegelbot Mar 8, 2019
@boegel
Copy link
Member

boegel commented Mar 9, 2019

@Louwrensth The changes look good and make sense to me.

I tested this with the following easyconfig (which is useless, but fine as a test case here):

easyblock = 'Tarball'

name = 'easybuild-framework'
version = '3.8.0'

homepage = 'http://easybuilders.github.io/easybuild'
description = "EasyBuild framework"

toolchain = {'name': 'dummy', 'version': 'dummy'}

sources = [{
    'filename': SOURCE_TAR_GZ,
    'git_config': {
        'url': 'https://github.com/easybuilders',
        'repo_name': 'easybuild-framework',
        'tag': 'easybuild-framework-v%(version)s',
        'recursive': True,
    },
}]

sanity_check_paths = {
    'files': ['eb'],
    'dirs': ['easybuild/framework'],
}

moduleclass = 'tools'

The changed implementation (still) works fine:

$ eb --inject-checksums --trace test.eb
== temporary log file in case of crash /tmp/eb-tEk8bO/easybuild-xJuNVu.log
== injecting sha256 checksums in /tmp/test.eb
== fetching sources & patches for test.eb...
  >> running command:
        [started at: 2019-03-09 15:09:26]
        [output logged in /tmp/eb-tEk8bO/easybuild-run_cmd-67Drf6.log]
        git clone --branch easybuild-framework-v3.8.0 --recursive https://github.com/easybuilders/easybuild-framework.git
  >> command completed: exit 0, ran in 00h00m06s
  >> running command:
        [started at: 2019-03-09 15:09:33]
        [output logged in /tmp/eb-tEk8bO/easybuild-run_cmd-msBcfG.log]
        git archive -o /tmp/sources/e/easybuild-framework/easybuild-framework-3.8.0.tar --prefix=easybuild-framework/ HEAD
  >> command completed: exit 0, ran in < 1s
  >> running command:
        [started at: 2019-03-09 15:09:33]
        [output logged in /tmp/eb-tEk8bO/easybuild-run_cmd-obDiLb.log]
        git submodule foreach 'git archive -o $name.tar --prefix=easybuild-framework/$path/ HEAD'
  >> command completed: exit 0, ran in < 1s
  >> running command:
        [started at: 2019-03-09 15:09:33]
        [output logged in /tmp/eb-tEk8bO/easybuild-run_cmd-YxJAUV.log]
        git submodule foreach 'tar -f /tmp/sources/e/easybuild-framework/easybuild-framework-3.8.0.tar --concatenate $name.tar'
  >> command completed: exit 0, ran in < 1s
  >> running command:
        [started at: 2019-03-09 15:09:33]
        [output logged in /tmp/eb-tEk8bO/easybuild-run_cmd-4F6HVa.log]
        gzip -nf /tmp/sources/e/easybuild-framework/easybuild-framework-3.8.0.tar
  >> command completed: exit 0, ran in < 1s
  >> sources:
  >> /tmp/sources/e/easybuild-framework/easybuild-framework-3.8.0.tar.gz
== backup of easyconfig file saved to /tmp/test.eb.bak_20190309150933...
== injecting sha256 checksums for sources & patches in test.eb...
== * easybuild-framework-3.8.0.tar.gz: b8e500bd59946249846ec8f4e97fd5005b090982ed89b0108421bf9fdb0727e1
== Temporary log file(s) /tmp/eb-tEk8bO/easybuild-xJuNVu.log* have been removed.
== Temporary directory /tmp/eb-tEk8bO has been removed.

However, I'm still getting a different checksum when testing on two (very) different systems:

  • on macOS 10.14.3 with:

    • gzip: Apple gzip 272.220.1
    • tar: bsdtar 2.8.3 - libarchive 2.8.3
    • git: git version 2.17.2 (Apple Git-113)
    • checksum: f6f77e5d4251d6f02f90dfc378ec307be289e01a712d74c0ff13f5ba04f5f653
  • on CentOS 7.6 with:

    • gzip: gzip 1.5
    • tar: tar (GNU tar) 1.26
    • git: git version 1.8.3.1
    • checksum: b8e500bd59946249846ec8f4e97fd5005b090982ed89b0108421bf9fdb0727e1

Perhaps it's a bit unfair to expect getting the exact same checksum on two systems that are so different, but I just wanted to bring that up...

Thoughts? Would using bzip2 rather than gzip help?

One additional thing we should keep in mind that changing the implementation will also result in getting different checksums on the same system when using an EasyBuild version that includes these changes.
That's not necessarily a blocker, just something to keep in mind.

Maybe we should emit a clear warning when checksums are being used in easyconfigs that use git_config?

@akesandgren
Copy link
Contributor

@Louwrensth Any progress on this? I.e. regarding the last comment from @boegel

@Louwrensth
Copy link
Contributor Author

@akesandgren :

@Louwrensth Any progress on this? I.e. regarding the last comment from @boegel
No progress. I've been busy elsewhere.

@boegel :

Perhaps it's a bit unfair to expect getting the exact same checksum on two systems that are so different, but I just wanted to bring that up...

Thoughts? Would using bzip2 rather than gzip help?

One additional thing we should keep in mind that changing the implementation will also result in getting different checksums on the same system when using an EasyBuild version that includes these changes.
That's not necessarily a blocker, just something to keep in mind.

Maybe we should emit a clear warning when checksums are being used in easyconfigs that use git_config?

I do not know how to get around this... Maybe bzip2 would help, but maybe it's the OS/filesystem that makes the files always slightly different before zipping...

Maybe we can live with the warning?

Or maybe we make use of the git hash instead of zipping+checksumming? It will brake with EB method of keeping a source tarball of each installation.

@boegel
Copy link
Member

boegel commented Sep 29, 2019

I think the main cause of getting different checksums is a different version of the tools that come into play (git, tar, gzip or bzip2, etc.). It's clear that we can't assume we'll get the exact same tarball on different systems. See also https://wiki.debian.org/ReproducibleBuilds/Howto

One option would be to compute a collective checksum on the contents of the unpacked sources, without packing it into a tarball at all, since the contents are exactly the same on different systems, and that's what we actually care about...
We could then set contents_checksums = True to tell EasyBuild how to compute/check the specified checksums using the contents rather than with the tarball itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants