Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Much faster implementation of FileList, for big egg_info speedups #610

Closed
wants to merge 1 commit into from

Conversation

spookylukey
Copy link
Contributor

Refs #249 and #450

This is not a complete PR, but request for the approach to be confirmed before I do more work.

There are some notes:

Code

The approach, as outlined before, is to reduce findall to a no-op, so the directory tree is not scanned, and have include_pattern (which handles all 'include' type directives) execute an os.walk directly, but trimming the search tree according to the different types of rules.

It currently doesn't handle all cases correctly (specific graft doesn't work correctly), but is a proof of concept. If the approach is accepted, I would add more tests (since we can no longer rely on distutils have tests since we are overriding the implementation) and fix this bug. All the tests in setuptools pass with the patch.

Some testing on Windows will be necessary, the directory path separator thing gets quite involved, I don't have a Windows box I could easily use.

Performance tests are hard - we really want to test that other parts of the filesystem are not being scanned, which is hard to do.

Removal of _add_egg_info function, called from add_defaults

I believe this is correct. The behaviour of add_defaults seems to be tested by test_egg_base_installed_egg_info and test_install_requires_with_markers which both fail if you break it, along with a few other tests that fail.

AFAICS, the code is only working around the fact that include_pattern queries the allfiles attribute and not the filesystem, and is therefore out of date with respect to files that have been created by setuptools itself. Since we've changed the behaviour of include_pattern, I don't think this is necessary, and the tests seem to verify this.

Metrics

Obviously this depends on how many files you've got in your working folder, and on whether the directory listings are in the OS cache.

I used this project as a not-too-extreme example, and because I actually work on this project:

https://github.com/django-money/django-money

It uses tox to test against about 80 environments, which are stored as virtualenvs in .tox

Without patch

From a 'cold' state, but with virtualenvs created from previous usage, I ran the tests on a single environment:

tox -e django18-py27

This took 647 seconds (10 minutes), of which just 2.8 seconds were the actual tests i.e. 645 seconds of overhead from setuptools and tox.

Doing it a second time, this reduced to 37 seconds of overhead, due to OS caching of filesystem.

With this patch

With the patch, the setuptools + tox overhead of the same command reduced to just 2 seconds, whether from warm or cold i.e. a speed up of 18 times to 320 times (and more importantly, down to "a few seconds", an acceptable time to wait for your tests to start running). This would make a BIG difference to workflow on quite a few projects I use.

To reproduce this speedup, however, you will need to install the dev version of setuptools in both the outer project (in which tox is installed) and in the virtualenvs it created.

There are obviously cases and projects where this strategy will be less efficient than the previous one, because it requires multiple os.walk walks over the file system, instead of just one, but for normal projects, and for any that contain large numbers of non source files, it is likely to be a very big
win.

Notes

There will be no speedups if MANIFEST.in contains a global-include directive, and in fact probably severe slow-downs if you have lots of files or lots of global-include directives. This is because you cannot trim the search tree at all for this directive - a global-include includes all subdirectories, including hidden ones such as ".tox", ".my_IDE_like_to_put_stuff_here" etc. This means that in general the global-include directive should be avoided, because most tools expect to be able to put all kinds of things in hidden folders and have them ignored. This warning should probably be submitted to distutils as a documentation patch.

@jaraco
Copy link
Member

jaraco commented Jun 24, 2016

The main reluctance I have with this approach is it increases the reliance on MANIFEST.in, which is a duplicative, redundant functionality with source-code control systems (as I mentioned in this comment). The SCM systems keep a manifest of files in the project and have mechanisms to ignore whole trees (e.g. .gitignore). Furthermore, these systems provide a robust, layered mechanism for ignoring files and trees at the system or user level in addition to the project level. The more we can leverage this functionality, which is almost universally adopted, the more we can rely on these advances. I imagine developing from non-SCM tarballs is rare and that issues like #249 can be disregarded in those cases.

I'd really prefer a solution that relies on SCM file finders and limit the reliance on the (soft-) deprecated MANIFEST.in. If a SCM file finder solution is nonviable, I'd at least like to have robustly determined such before adopting another approach.

@spookylukey
Copy link
Contributor Author

In what way is this an increased reliance on MANIFEST.in? The situation is exactly the same as before, my patch is simply a change to the implementation of the MANIFEST.in processing to be (typically) much faster than the one inherited from distutils.

It would be nice if this wasn't necessary, but at the moment, using an SCM file finder will not improve the performance problems with egg_info, because FileList.findall() runs as well as the SCM file finder. AFAICS, this behaviour can't change, because:

  1. Sometimes you don't want to package everything under source control
  2. Sometimes you want to add something not under source control (e.g. a file generated from one under SCM).

Number 1 is explicitly talked about in the setuptools-git docs - https://pypi.python.org/pypi/setuptools-git

Is there actually a plan to change this and a way forward to solve these?

With this in mind, I can't see how it is possible to rely on an SCM file finder to 'provide' optimisation. The only thing I can think of is if we use '.gitignore' etc. files to avoid searching into directories. However, this is full of problems:

  1. .gitignore files will probably list generated files/directories, but we might actually want these in distributed tarballs.
  2. It could easily get complicated by SCM 'global' ignore files (e.g. I have .tox in my global .gitignore, not local)
  3. It would require a lot of additional functionality in setuptools, interfacing with different SCMs and all the ways they might specify files to ignore.

It also seems a very strange dependency - why do we need an SCM tool to tell us not to bother looking in certain directories, when there wasn't a reason to look in them in the first place? With hindsight, "first get a list all the files in all the folders" was simply not a very good strategy for implementing the MANIFEST.in language, and the solution is simply "don't do that", which is what this patch implements.

@jaraco
Copy link
Member

jaraco commented Jul 24, 2016

In what way is this an increased reliance on MANIFEST.in?

If someone wants to gain the performance benefits of this patch, they will be required to create a MANIFEST.in, even if the project is relying on SCM info, and adds confounding factors to its purpose.

What I'd really like to see is a logic flow where file discovery techniques are clearly described and selected... something like a Discovery class with three implementations, one for Manifest.in, another for SCM, and another for the most simple heuristic (traversal including *.py files). And exactly one of those is chosen.

What worries me most is this logic feels interwoven with an already complicated implementation. Every time I look at it, I feel like I don't fully grasp all the implications. I worry the added complication will lead to more problems with the design down the road, and it won't be obvious where this functionality exists or why.

But I'm probably just overcautious. Do feel free to continue with the effort. We can roll it out and see what reception it gets.

@mx-moth
Copy link
Contributor

mx-moth commented Aug 15, 2016

Many projects I work on contain files that are tracked in git that should not be packaged (such as docs, tests, etc), and files that are not tracked but should be packaged (such as compiled JS, CSS assets). Integrating with git or any other SCM would be counter productive for the projects I work on, and I suspect many others as well.

While the overhead in the projects I work on is not quite as extreme as for @spookylukey, it is still noticeable. A fix would be nice. I've also submitted a patch for pip, and another for setuptools (#733) for similar speed improvements. I was very happy to see a fix for this issue already submitted!

@jaraco I understand that you desire a more flexible solution to this problem, with multiple ways of specifying files to include. I've also struggled with the limitations of the MANIFEST file format. This PR solves a problem that people are experiencing right now though. Merging this PR would not prevent you from implementing your Discovery class in the future, and when the Discovery class is implemented this solution could be made more generic so more classes could reuse it.

Is there anything I can do to help move this PR along?

@jaraco
Copy link
Member

jaraco commented Aug 15, 2016

I believe this PR is a work in progress. My last message indicated that we should roll it out, so I'm waiting on a follow-up to say that it's ready to go.

@timheap, you can help by reconciling which of these two patches are preferable (or justifying how they're both desirable).

@mx-moth
Copy link
Contributor

mx-moth commented Aug 15, 2016

@jaraco This patch and #733 address different performance problems in different areas, so they are both desirable. They affect unrelated sections of setuptools, the only similarity is the performance fix.

I'll have a look to see if I can help finish this PR as well. I installed the combination of this PR and mine and it made setuptools very speedy!

@spookylukey
Copy link
Contributor Author

@timheap thanks for the offer of help.

I'm unlikely to have time to look more at this in coming months. In terms of implementation, I don't think there was a huge amount more to do on this PR, although there is at least one TODO in there.

However, more tests would be nice. I'm afraid I can't remember how much of this code had test coverage. I think it probably needs more testing on other platforms.

@spookylukey
Copy link
Contributor Author

@jaraco thanks for your continued help/patience with this PR, by the way!

@jaraco
Copy link
Member

jaraco commented Oct 14, 2016

I'm going to close this PR, but to be sure, I'm welcoming further work on it. Just open a new one when you have more code for consideration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants