Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix unnecessary downloads by pipenv lock #3827 #3830

Closed
wants to merge 3 commits into from
Closed

Fix unnecessary downloads by pipenv lock #3827 #3830

wants to merge 3 commits into from

Conversation

ghost
Copy link

@ghost ghost commented Jul 9, 2019

Closes #3827

The issue

pipenv lock retrieves all possible artifacts (even for other platforms/arch) in order to calculate to their hash, even when Pypi includes the sha256 hash in the url. This means lengthy downloads (In one case 893MB vs. 50MB), which users experience as pipenv hanging.

The fix

If the artifact url already contains a hash value, using the same hash function which pipenv would have used, we simply extract and use it in directly without downloading anything. It is also stored in the hash cache.

The checklist

  • Associated issue
  • A news fragment in the news/ directory

If this is a patch to the vendor directory…

This fixes an existing vendor patch merged in cb895aa.

@frostming
Copy link
Contributor

It's a change of upstream library, but I am OK to this. Just need another look by @techalchemy @jxltom @uranusjr

@frostming frostming added Category: Dependency Resolution Issue relates to dependency resolution. Category: Performance Issue relates to performance Type: Vendored Dependencies This issue affects vendored dependencies within pipenv. Status: Awaiting Review This item is currently awaiting review. labels Jul 9, 2019
@uranusjr
Copy link
Member

uranusjr commented Jul 9, 2019

Edit: Oh nvm, this is a Pipenv-only patch.

@uranusjr
Copy link
Member

uranusjr commented Jul 9, 2019

I think I am OK with the implementation, but not entire sure whether it is a good idea to trust the URL hash. But OTOH, if the URL could be compromised, so can the downloaded package, so I guess in philosophy none is better than the other anyway, you have to trust one of them eventually.

@techalchemy
Copy link
Member

I guess I have mixed feelings about this. I've been saying for some time now that 'hash checking' is what takes a long time, but in reality it's the fact that we are requiring the download of multiple platforms worth of every artifact in order to lock due to our patches on pip.

I'm a firm no on this specific PR for security reasons (we discussed the different approaches early in the project; the Hash Cache is itself patched into piptools and not native) and for security reasons obviously we opted to download the contents and build the hashes that way.

From a reproducibilty standpoint however, it gains you nothing to download and hash the artifacts for other platforms which you yourself will not be installing, since you can't use them and aren't building your environment with them. You will never have any idea if they actually work properly. On the other hand, simply relying on a hash from a URL as a manner of trust is highly insecure and doesn't even require an SSL handshake / there is no certificate verification even, and that was the reason we never agreed to do it that way. That reasoning, fundamentally, hasn't changed, so I'm not too sure our approach on that specific point should.

I am open to alternative options there though, clearly downloading the entire internet worth of packages is not a great way of handling this...

@frostming
Copy link
Contributor

frostming commented Jul 9, 2019

Yeah, I think it is the purpose of the original commit cb895aa

It only stores cache when the hash part exists in the URL, but never extracts it, which looks like kind of mistake.

EDIT

I got some insight into the reason for doing this from the comment of @techalchemy . What do you think about taking advantage of some "pypi.org only" methods(/pypi/{package}/json endpoint) to retrieve the package meta, and fall back to the file hash if not available.

@techalchemy
Copy link
Member

hmmm using the json endpoint is an interesting idea, it's probably equally secure compared with downloading from there, right? The content is still encrypted / we will still be doing the SSL handshake yeah?

/cc @nateprewitt I know you had concerns about this 2 years ago when we implemented it

@uranusjr
Copy link
Member

uranusjr commented Jul 9, 2019

Using the JSON endpoint is fundamentally the same as using the URL-provided hash. They are both strings from the index server. I’m firmly -1 on the JSON endpoint. Just use the Simple page; if that is a bad idea, so is JSON.

@ghost
Copy link
Author

ghost commented Jul 10, 2019

On consideration, this whole discussion is a moot point.

pipenv must download/build artifacts and resolve dependencies once in order to install them. That means by the time lock is run, it should already have all the information it needs to reproduce the build.

pip-tools seems to follow this "fetch once" model. Why doesn't pipenv?

@uranusjr
Copy link
Member

It’s not entirely moot IMO. Pipenv only needs to download one artifact (ideally) for resolution, the URL-provided hash can still avoid some downloads.

@techalchemy
Copy link
Member

pipenv must download/build artifacts and resolve dependencies once in order to install them. That means by the time lock is run, it should already have all the information it needs to reproduce the build.

We obviously would not want to download artifacts more than one time, I've been saying everywhere this issue comes up that if someone can identify whether this is actually occurring and how we can fix it, we will absolutely merge that change.

@techalchemy
Copy link
Member

The same argument can be made for retrieving the artifact itself, as pointed out. If you trust the server, you can trust its hash hint. The PR should definitely require a trusted connection, but I don't see how following a url provided by the server is considered secure, while the hash information in the very same url is considered insecure. If you feel differently, can you explain the threat model or point to the early discussions you mentioned?

The point about hash hints is again not a point about trust specifically, it's a point about reproducibilty. If you are downloading the artifact, we want to ensure that we know exactly which artifacts represent the valid set of options at a given point in time. We can only actually make a guarantee about the one you installed, which is why it might make sense to avoid the downloads entirely in the alternative cases.

@techalchemy
Copy link
Member

So you're concerned about the case in which the artifact store (due to a bug, not an attack), reports the wrong hash hint for the artifact?

No, I'm concerned about using hash hints at all, which is a fundamental shift in our implementation.

The only reason I am considering it is because pipenv currently downloads and hashes artifacts for each platform and python version, no matter whether it is actively performing an installation with those artifacts. However, it also caches the outcome in a local hash cache. If we switched to using the hash hints, would we use them in place of those supplied in the existing pipfile?

The difference here is that we are only using a single artifact, so actually performing the hash is only telling us anything useful about that individual artifact (i.e. from a reproducibility standpoint, we can only be sure that the one specific artifact works the same, since we didn't install any of the others).

To address a point @uranusjr made the other day:

Using the JSON endpoint is fundamentally the same as using the URL-provided hash. They are both strings from the index server. I’m firmly -1 on the JSON endpoint. Just use the Simple page; if that is a bad idea, so is JSON.

It's not fundamentally the same since the proposal is to stop downloading all the artifacts in the first place. We currently only have the option to use the url hints because we expose ourselves to each InstallRequirement during resolution via various hacks to pip's resolution process. With this proposal we'd basically be removing all of that logic and just grabbing hashes from the hashes listed at the json endpoint directly.

Using the simple index would require us to use the xmlrpc client which is not really a great idea.

Overall this is a pretty major change to our hashing logic. I appreciate your desire to move this forward and I do think it's possibly a good idea but I also want to convey that this is a serious modification to both our security model (which I didn't design and probably don't understand well enough to sign off on) and to our core resolution logic. We can't be haphazard about this, so it will take some time to consider how to move forward.

@ghost
Copy link
Author

ghost commented Jul 15, 2019

No, I'm concerned about using hash hints at all, which is a fundamental shift in our implementation.

The current implementation has unreasonable lock times for many users. It needs changing if this is a problem you want to fix.

since the proposal is to stop downloading all the artifacts in the first place.

no, the proposal (which this PR now only partially implements) was to hash and cache the hash every artifact that is downloaded at download-time and use hash hints for what hasn't been installed (or tested) at lock time.

Overall this is a pretty major change to our hashing logic. I appreciate your desire to move this
forward and I do think it's possibly a good idea but I also want to convey that this is a serious
modification to both our security model (which I didn't design and probably don't understand well
enough to sign off on) and to our core resolution logic. We can't be haphazard about this, so
it will take some time to consider how to move forward.

You've convinced me that I won't be able to fix this with the level of effort I'm willing to put in. I've only been using pipenv for a few days and at this point the simplest choice for me is to move back to pip.

Thanks.

@ghost ghost closed this Jul 15, 2019
@techalchemy
Copy link
Member

The current implementation has unreasonable lock times for many users. It needs changing if this is a problem you want to fix.

We agree, the only thing anyone is uncertain about is the approach

no, the proposal (which this PR now only partially implements) was to hash and cache the hash every artifact that is downloaded at download-time, and use hash hints for what hasn't been installed (or tested) at lock time.

Right, that is restating the same point. We currently do download every artifact at lock time. Anything that modifies this behavior (i.e. using hash hints) is therefore inherently proposing to change that fact.

@ekhaydarov
Copy link

With @pilkibun on this. pipenv is dead. its been a half hour trying to install from lock file. Agree that sci packages may slow down installs but if you think of it in terms of time spent. Me spending a couple of hours resolving dependency issues once every 3-6 months or 30+ mins wasted on every ci/test/local build? its a no brainer to move back to pip

@andyneff
Copy link

andyneff commented Aug 6, 2019

install from lock file

@ekhaydarov consider not locking when you are just installing from lock file? I don't like ci/testing updating the packages on me, especially when those updates aren't going to be saved/tracked. So I find pipenv sync to be a better fit there.

@Jmennius
Copy link

Jmennius commented Oct 22, 2020

Hello,

(1)
I think there is something there should be a consensus about among maintainers - this issue affects pipenv users (#3827) and something should be done about it (even if it requires changing some of pipenv behavior/design).
This is likely the single largest issue that makes users perceive pipenv as slow.

For example, when installing tensorflow (for the first time or a new version and locking), pipenv will download 12 tensorflow wheels, totaling about 3GiB where approx 2.8GiB are not necessary (11 packages for different python versions and operating systems).
Even on a normal Internet connection (up to 100Mbits, but average download speeds are maybe third or a quarter of that) users will face timeouts from pipenv at about 17 minutes in (PIPENV_INSTALL_TIMEOUT).

Tested the following (2 cores, 4 GB RAM, high throughput network on AWS EC2 with traffic shaping):
On a reasonable asymmetrical 10/1 Mbits link that means that the whole pipenv install takes about an hour.
On the same link, poetry will finish in just 6 minutes (i.e. 10 times faster).
pipenv with a change proposed in this PR also finishes in 6 minutes (again, 10 times faster).

I would like to cover some concerns that were voiced in that PR:

(2)

The hash from the URL is just a hint.

According to PEP503 which defines the Simple Repository API - hash in the URL is not just a hint, it is also mandatory for repositories in the Simple format.

My interpretation is that this hash is the actual source of truth for your download - you should check your download against that hash to verify that the downloaded file is not corrupted or has been tampered with (if it doesn't match - assume your download has gone bad).

(3)

Trust and other security reasons

When you download from PyPI, you ultimately trust the host from which you download because of the TLS. The trust doesn't go beyond that as there are no signatures for most (?) packages. Essentially you just trust the CDN and you have no other choice (but rarely signed wheels).
With TLS, whatever the host tells you (the URL, the hash, the file contents) is guaranteed to come from that host and has not been tampered with in the process of transferring, but you can't really be sure that it comes from PyPI other that the host has the domain name that belongs to PyPI.

My point is that you have equal trust for the hash in the URL and the file contents and because of the previous point we should actually prefer the hash.

(4)
Can I propose that perspective?

pipenv should handle reproducibility and security should be offloaded to wheels with (attached or detached) signatures.

At the moment, the only security that pipenv provides is that the wheels did not change after the lock.
This approach is very limited since it relies on the fact that at the 'lock' time you got untampered files, and you can't guarantee that without checking the signatures.

@frostming @uranusjr @techalchemy Let's agree on # 1 and discuss the rest.


edit: this was resolved in #4500, thank you!

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Category: Dependency Resolution Issue relates to dependency resolution. Category: Performance Issue relates to performance Status: Awaiting Review This item is currently awaiting review. Type: Vendored Dependencies This issue affects vendored dependencies within pipenv.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pipenv lock hangs. It really does.
6 participants