Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pip -r double installs if called also from within setup.py #2745

Closed
johnyf opened this issue May 3, 2015 · 10 comments
Closed

pip -r double installs if called also from within setup.py #2745

johnyf opened this issue May 3, 2015 · 10 comments
Labels
auto-locked Outdated issues that have been locked by automation

Comments

@johnyf
Copy link

johnyf commented May 3, 2015

In a clean virtualenv, with:

  • python 2.7.9
  • pip v6.1.1

the requirements.txt:

dd
ply==3.4

leaves a site-packages containing:

0 drwxr-xr-x  14 ... ply
0 drwxr-xr-x   7 ... ply-3.4-py2.7.egg-info
0 drwxr-xr-x   7 ... ply-3.6-py2.7.egg-info

with directory ply containing files from both v3.4 and v3.6.

I think that this is caused by a nested call to pip from the setup.py of package dd:

pip.main(['install', 'ply >= 3.4', ...])
...
setuptools.setup(...install_requires=['ply >= 3.4', ...],...)

Apparently, pip collects all requirements, decides to install ply v3.4, starts building packages.
While building packages, the nested call to pip should see no installed ply, if all packages are going to be copied at the end to site-packages (not sure if this is the behavior of pip).
So the nested call installs ply v3.6, satisfying ply >= 3.4, because it is oblivious of the requirement ply==3.4 in the requirements file. In addition, the outer call to pip that handles the requirements.txt doesn't know that a nested call occurred, so apparently it is going to build and copy ply v3.4 on top of the nested install of ply v3.6. This leaves a corrupted ply in site-packages.

@qwcode
Copy link
Contributor

qwcode commented May 3, 2015

a setup.py shouldn't be calling pip. that's not part of the model.

instead of calling pip, dd should consider using setup_requires and possibly a command extension to handle whatever it's doing, or just not try to do what it's doing as part of installation.

@qwcode qwcode closed this as completed May 3, 2015
@rbtcollins
Copy link

Perhaps we should consider this a defect (that its possible) and arrange to detect it?

@qwcode
Copy link
Contributor

qwcode commented May 3, 2015

maybe, we could re-check before every install?

but as it is, the check_if_exists call for ply==3.4 is already happening after the install of ply-3.6,so theoretically it could detect it, but it's not, I guess due to it being 3 subprocesses removed, and there being an access delay to the metadata?

@johnyf
Copy link
Author

johnyf commented May 3, 2015

It took a long time and studying to arrive at this solution. Some reasons are documented in this commit message. It notes why setup_requires is not used.

[I just noticed that the setuptools issue which led me not to use setup_requires looks like it has since been fixed. Nonetheless, this would still require to build the tables mentioned below inside the call to setuptools.setup -- I'm trying to avoid this nesting.]

Also, I have studied the post-install hook approach with setuptools, as used, for example, by pycparser. It is nested, complicated (not very PEP 20-friendly), and touches setuptools internals.

Calling pip inside setup.py is the cleanest solution, simplest to understand (flat), and maintain / debug.

Any package that uses ply or another parser generator that saves tables (or any other form of file-persistency) needs to ensure that the tables are placed inside site-packages at installation time (e.g., necessary because the users may not have write privilege to site-packages). This can happen either by generating them before (like dd), or after (like pycparser) copying to site-packages. I consider the first approach linear and simpler, given the current API of setuptools.

It also works unchanged for python setup.py develop.

Ideally, setuptools would have provided a clean way of breaking up the steps of dependency installation, package installation, etc. and allow doing things in between flatly, without having to pass cmdclass. Since this is not the case, the simplest thing is to call pip first, then setuptools.

A minor tweak to avoid race conditions was to detect egg_info passed by pip at the phase of collecting information.

Somewhat more generally, it would be good if there was some way of communicating to pip what needs to be installed before calling setuptools.setup, other than those ways that go through setuptools.setup. Sort of a replacement for setup_requires that ensures the requirement will be installed permanently.

The approach of calling pip inside setup.py suffers from the unidirectionality of information flow: the nested call to pip is oblivious of the topological sorting done by the pip that is running setup.py.

@johnyf
Copy link
Author

johnyf commented May 3, 2015

Correction to comment: there is (implicit) communication from the calling pip instance to the nested pip instance: the latter should be able to see a dependency (like ply) installed by the caller (who reasoned about what's the maximal acceptable version for that dependency), and skip installing it, since it is already satisfied.

This simple behavior was one of the original reasons that I chose to go with calling pip in setup.py.

@johnyf
Copy link
Author

johnyf commented May 3, 2015

So I think I've found the cause. It is not as conjectured above. What happens is:

  • unlike the other packages, I had not yet applied todd's setup.py the guard that checks whether the args contain "egg_info". So dd installs ply v3.6 during the collection phase of the pip -r requirements.txt. Confirmed with pip install -r --no-install requirements.txt.
  • apparently pip doesn't notice that ply was installed in that phase, and goes ahead installing ply v3.4 on top of it (w/o uninstalling), because it figured out that that's the maximal version satisfying all requirements.
  • pip continues with the next phase, installing packages. During this phase there is no problem, contrary to my hypothesis above. The nested calls to pip must be detecting the existing ply and skipping. Confirmed by inserting the "egg_info" guard in dd, which leads to an uncorrupted site-packages. Also, interrupting the installation after ply has been installed from the requirements, but before anything else has been installed (in a requirements.txt that contains more) shows that site-packages has already been corrupted by then.

So pip should check between the collection and installation phase that no one has installed any of the packages scheduled for installation. In principle, this could also happen by a completely independent agent installing ply while running pip -r (#2746).

@ionelmc
Copy link
Contributor

ionelmc commented May 3, 2015

My opinion is that pip should keep a lock file in the target site-packages and disallow using multiple pip install at the same time.

@rbtcollins
Copy link

With sufficient work we could error if the installed packages have changed during collection. Any changed to the package database during collection invalidates all of collection: new additional packages may be conflicting requirements (#2687) or if they satisfy a desired package may have different dependencies of their own (#988) and so we'd at minimum have to perform the entire collection again.

Dealing with other installing tools implies that the locking mechanism should be a distutils PEP and standardised.

@johnyf
Copy link
Author

johnyf commented May 4, 2015

Thanks for the ideas. I think that (global) locking is the appropriate way to prevent concurrent installations from messing up site-packages.
This addresses concurrency issues. Build dependencies (#2381) seem to be a different issue, for the following reasons.

Revisiting the nested pip calls altogether, they are not needed any more. They were introduced mainly because pip did not promise to install dependencies before dependents. Now that it does (provided that the dependency graph is acyclic), it suffices to check that they are present or try-except, instead of calling pip.main.

Any installation using pip should now work. The only use cases that won't are calls to python setup.py install (develop etc) when missing build dependencies. These are typically manually done, so the warning or error will lead to either using pip instead, or first installing dependencies (if necessary for building), or re-running (if only caching tables).

Also, detecting that "egg_info" is given as argument is a problematic approach, because a user (unlike pip) can give both the "egg_info" and "install" options. For packages that really need the build dependencies, they must check that a call is only for "egg_info":

install = False
for option in sys.argv:
    for s in ('install', 'develop', 'dist', 'build'):
        if s in option:
            install = True
if install and missing_dependencies:
    raise ImportError('Required build dependencies not found.')

This way, they allow pip to collect information.

Fortunately, things can be simplified for packages that are only caching things: print a warning instead of distinguishing between "egg_info" and installation runs. This will lead to the desired result when using pip, and sub-optimal installation (no cached tables) if python setup.py w/o optional build dependencies already present.

@rbtcollins
Copy link

Build dependencies should be declared in setup_requires. I realise that that has some poor behaviours today, but it is the defined interface - and I'm working up the stack to be able to make that declarative soon I hope.

johnyf added a commit to tulip-control/dd that referenced this issue May 4, 2015
johnyf added a commit to johnyf/promela that referenced this issue May 4, 2015
johnyf added a commit to tulip-control/omega that referenced this issue May 4, 2015
johnyf added a commit to johnyf/openpromela that referenced this issue May 8, 2015
johnyf added a commit to johnyf/openpromela that referenced this issue Jun 17, 2015
johnyf added a commit to tulip-control/tulip-control that referenced this issue Sep 11, 2015
These are the requirements listed in `setup.py`. Extras have been included, except for those that require fragile compilation / linking (`cvxopt` and dependents).

Versions pinned (good practice for requirement files) to same numbers as those in `setup.py`. These are outdated versions, and should be changed explicitly, in separate commits, so that the changes don't go unnoticed.

Note that this provides a common reference.

The introduction of `requirements.txt` was motivated by (among other reasons) the failure of `python setup.py install` when `setuptools` attempts to install `scipy` before `numpy`. This failure can be avoided with a sufficiently updated `pip`, because `pip` collects all dependencies, sorts them topologically, and installs them in that order. Therefore, without a `requirements.txt`, it is more cumbersome to install `tulip` with `python setup.py install`.

Another reason is the need to run `python setup.py install` twice, in case `ply` is absent. The message printed at the end of an installation can go unnoticed. Note that topological sorting by `pip` avoids this issue to, by installing `ply` before `tulip`.

Also, cf: pypa/pip#2745
@lock lock bot added the auto-locked Outdated issues that have been locked by automation label Jun 4, 2019
@lock lock bot locked as resolved and limited conversation to collaborators Jun 4, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
auto-locked Outdated issues that have been locked by automation
Projects
None yet
Development

No branches or pull requests

4 participants