Description
Tarballs contain more information than we need (e.g. users, groups, fine-grained permissions, timestamps), and also allows representing the same information in multiple ways (e.g. order of directory contents, files defined twice). The basic problems this creates is that files cannot be deterministically assembled into an archive. In practice this means:
- Directory registries cannot be verified against lockfiles as well
- Packages may accidentally depend on permissions only supported on some platforms
- Sources besides registries cannot be mirrored (distinct from what sorts of sources can serve as mirrors)
- Users may unintentionally leak information about their current system when publishing packages.
None of these is terribly pressing on its own, but hopefully they are worthy of a solution in aggregate.
The solution is first carefully deciding which metadata we wish to support---the information our archives will contain, and then picking a canonical form for every possible archive containing that information. A thornier question is whether existing uploads should be normalized according to the chosen schema.
For backwards comparability, it is probably best to stick with some subset tar. This is what Debian does. Where an extraneous field cannot be elided, it should be constrained to some fixed value. Either the most expressive posix tar variant could be used, or the most minimal format that supports the information in question.
Other options might be git's tree objects or Nix's NAR. The Merkle DAG used by the former can lead to better error messages and free dedup, but SHA1 is dubiously secure. The latter can be hashed however we like, but still runs into backwards-compat.
CC @eternaleye