-
-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster no-ops for VCS dependencies #1415
Conversation
Hey there! I wasn't sure if this is a change you'd be interested in. I think it can be pretty helpful in a number of situations (the 10x speedup for the editable case feels pretty compelling), but it does have the downside of mucking around with pip internals, changing packages after they're installed (understandably spooky), and comes with a bit of a performance hit in certain situations. In any case, I figured I'd clean it up and send it along in case it is something you all think would be useful to have. |
This comment has been minimized.
This comment has been minimized.
Sorry for the force pushes, they're to (hopefully 🤞) fix the tests. I should have run the full |
@FuegoFro No worries about you revision preference, github got better recently with force pushes. Still, please fix conflicts and assure everything reports green, maybe we can merge it for next release. |
Please address conflicts. Thanks |
This allows pip-sync to no-op faster for VCS dependencies that haven't changed. Before this change when running `pip-sync` with a requirements file that has VCS packages it would always uninstall and reinstall those dependencies. This is because `pip` primarily operates on versions, and assumes that different code will have a different version. However this assumption breaks down with VCS packages, in which two different commits may very well have different code but the same version. Therefore to be safe we historically have had to reinstall these VCS dependencies. This extends `pip-sync` to write some extra metadata to the installed package to be able to track what revision the package was installed from, and then skip the uninstall-and-reinstall if the revision that we want to install is the same as the revision that is already installed. This significantly speeds up no-op `pip-sync` invocations for requirements that have multiple VCS packages. The tests added are written as best I could for unit tests. I think a better test would likely be to spin up an entirely new virtualenv, do a sync in it, and verify that a subsequent sync doesn't reinstall anything, but the exsting tests seem less like full integration tests so I didn't go down that route. As is, there's a somewhat tricky bit of logic that reaches into the workings of `pip` that's mocked out in the tests (namely the `_reload_installed_distributions_by_key` function). Given a `requirements.in` of ``` -e git+https://github.com/python/mypy.git@v0.902#egg=mypy -e git+https://github.com/psf/black.git@21.5b2#egg=black -e git+https://github.com/yaml/pyyaml.git@5.4.1.1#egg=pyyaml ``` and a corresponding `requirements.txt` of ``` -e git+https://github.com/psf/black.git@21.5b2#egg=black -e git+https://github.com/python/mypy.git@v0.902#egg=mypy -e git+https://github.com/yaml/pyyaml.git@5.4.1.1#egg=pyyaml appdirs==1.4.4 click==8.0.1 importlib-metadata==4.5.0 mypy-extensions==0.4.3 pathspec==0.8.1 regex==2021.4.4 toml==0.10.2 typed-ast==1.4.3 typing-extensions==3.10.0.0 zipp==3.4.1 ``` I ran tests to determine how this affects the fresh install time as well as the no-op install time. I also did the tests but with non-editable VCS dependencies, both with the tag and with an exact commit hash (since pip has optimizations for the latter case): | Environment | Fresh install time (approx seconds) | No-op install time (approx seconds) | | ---------------------------------------------------- | ----------------------------------- | ----------------------------------- | | Editable, without optimization | 20 | 12 | | Editable, with optimization | 20 🟡 | 0.5 🟢 | | Non-editable, without optimization | 20 | 20 | | Non-editable, with optimization | **25** 🔴 | 5 🟢 | | Non-editable with exact commit, without optimization | 3 | 2 | | Non-editable with exact commit, with optimization | **7** 🔴 | **4** 🔴 | As you can see, most of the number remain similar (for fresh installs) or significantly better (for no-op installs). There are some regressions in a few places, namely due to the extra time it takes to calculate the relevant information, both when detecting no-ops before installation and when writing that into the environment after installation.
Hey, sorry for the long delay getting back on this! I've rebased and resolved conflicts and have tests passing. I didn't have a Windows machine to test them on (and the failure seems to be Windows specific), so I apologize for the iterative approach here. The "fix" involved just... ignoring Also I wasn't sure if the code coverage provided by my current tests will be sufficient, but I wanted to get a sense of whether this is even worth investing more time in before trying to add more tests (though, I'm also at a bit of a loss for what meaningful other tests could be added; would love advice there if we do want more coverage). Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to get a sense of whether this is even worth investing more time in before trying to add more tests
We tend to move away from using pip private internals, see #1654. So I would vote against bringing more internals to the codebase, even if it's speed improvement which looks great.
I agree with @atugushev too much pip internals. If we want speed, we need changes to pip itself to address these issues. Any pip internal that we use does make our code much harder to maintain. |
This allows pip-sync to no-op faster for VCS dependencies that haven't changed. Before this change when running
pip-sync
with a requirements file that has VCS packages it would always uninstall and reinstall those dependencies. This is becausepip
primarily operates on versions, and assumes that different code will have a different version. However this assumption breaks down with VCS packages, in which two different commits may very well have different code but the same version. Therefore to be safe we historically have had to reinstall these VCS dependencies.This extends
pip-sync
to write some extra metadata to the installed package to be able to track what revision the package was installed from, and then skip the uninstall-and-reinstall if the revision that we want to install is the same as the revision that is already installed. This significantly speeds up no-oppip-sync
invocations for requirements that have multiple VCS packages.Tests
The tests added are written as best I could for unit tests. I think a better test would likely be to spin up an entirely new virtualenv, do a sync in it, and verify that a subsequent sync doesn't reinstall anything, but the exsting tests seem less like full integration tests so I didn't go down that route. As is, there's a somewhat tricky bit of logic that reaches into the workings of
pip
that's mocked out in the tests (namely the_reload_installed_distributions_by_key
function).Performance
Given a
requirements.in
ofand a corresponding
requirements.txt
ofI ran tests to determine how this affects the fresh install time as well as the no-op install time. I also did the tests but with non-editable VCS dependencies, both with the tag and with an exact commit hash (since pip has optimizations for the latter case):
As you can see, most of the number remain similar (for fresh installs) or significantly better (for no-op installs). There are some regressions in a few places, namely due to the extra time it takes to calculate the relevant information, both when detecting no-ops before installation and when writing that into the environment after installation.
Changelog-friendly one-liner: Faster no-ops for VCS dependencies
Contributor checklist