add "unused-files" pruning mode #1614
Description
https://golang.github.io/dep/docs/Gopkg.toml.html#prune currently defines 3 pruning modes:
non-go
just removes files that are not relevant to go anyways (e.g. .travis-ci.yml
files, README
s and so on)
go-tests
also removes test files
unused-packages
even removes some go code that is definitely not being used, since it is not referenced by the package that "owns" the vendor folder
I'd like to propose a pruning mode that is even more rigorous than unused-packages
:
unused-files
would remove every file that does not influence the (hash of the) resulting package that "owns" the vendor folder. This means that a package has to produce the same binary with a vendor folder that contains only unused-file
packages as a package built with an unpruned vendor folder. It however must not be possible to remove any set of files/directorys/symlinks from a unused-files
vendored dependency without also influencing the compilation result. Inversely this means that any file/directory/symlink that does not cause any change in the resulting binary has been removed. Even if it is in the same folder as an imported dependency, for example a go file that only contains comments or some classes that are not used and thus stripped away later would not be vendored in the first place with this strategy.
A naive approach would be to do an initial measurement with an unpruned vendor folder, get a list of files/folders/symlinks of the folder to be pruned and run a ddmin algorithm (e.g. https://github.com/dgryski/go-ddmin) over that list with the criterion that the binary hash must still be the same as the initial one. The remaining list of files/folders/symlinks is then not guaranteed to be a global minimum unfortunately, but it would be least 1-minimal (removing any single entry from that list would change the outcome). This can be sped up by various heuristics (e.g. it is VERY likely that pruning with the existing strategies first would shrink the list of potential files/folders/symlinks to be removed already down considerably while pure ddmin would struggle a while).
The advantages compared to the existing pruning strategies:
- No need to decide if/how/which tests, testdata or symlinks are relevant, there is an objective measurement to decide if they are necessary
- Minimizes the amount of data that needs to be checked in while keeping files unmodified (only rewriting code - e.g. stripping comments - would get any smaller than this)
- Guarantees the same behavior as the unpruned version (the current strategies just assume this property I guess?)
- The same strategy could be used to identify dead code/unused files in the actual code base too.
- Fewer issues with build systems like Bazel that try to build every go file that they find (and then fail due to optional dependencies of vendored code - take a look at In vendor and external repos, only generate rules needed to resolve dependencies bazel-contrib/bazel-gazelle#93 for an example what kind of pain this causes)
Downsides:
- Computationally expensive if done by (re)compiling and comparing hashes
- If more features of a vendored dependency are being used, it might become necessary to first get an unpruned version from upstream and then re-prune (this is already a potential issue with
unused-files
too) - You are only guaranteed to have the code you currently need available in your vendor folder, not a full "insurance" against vanishing upstreams (this is a general issue with pruning)