-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Much faster implementation of FileList, for big egg_info speedups #610
Conversation
The main reluctance I have with this approach is it increases the reliance on I'd really prefer a solution that relies on SCM file finders and limit the reliance on the (soft-) deprecated MANIFEST.in. If a SCM file finder solution is nonviable, I'd at least like to have robustly determined such before adopting another approach. |
In what way is this an increased reliance on MANIFEST.in? The situation is exactly the same as before, my patch is simply a change to the implementation of the MANIFEST.in processing to be (typically) much faster than the one inherited from distutils. It would be nice if this wasn't necessary, but at the moment, using an SCM file finder will not improve the performance problems with egg_info, because
Number 1 is explicitly talked about in the setuptools-git docs - https://pypi.python.org/pypi/setuptools-git Is there actually a plan to change this and a way forward to solve these? With this in mind, I can't see how it is possible to rely on an SCM file finder to 'provide' optimisation. The only thing I can think of is if we use '.gitignore' etc. files to avoid searching into directories. However, this is full of problems:
It also seems a very strange dependency - why do we need an SCM tool to tell us not to bother looking in certain directories, when there wasn't a reason to look in them in the first place? With hindsight, "first get a list all the files in all the folders" was simply not a very good strategy for implementing the MANIFEST.in language, and the solution is simply "don't do that", which is what this patch implements. |
If someone wants to gain the performance benefits of this patch, they will be required to create a MANIFEST.in, even if the project is relying on SCM info, and adds confounding factors to its purpose. What I'd really like to see is a logic flow where file discovery techniques are clearly described and selected... something like a Discovery class with three implementations, one for Manifest.in, another for SCM, and another for the most simple heuristic (traversal including *.py files). And exactly one of those is chosen. What worries me most is this logic feels interwoven with an already complicated implementation. Every time I look at it, I feel like I don't fully grasp all the implications. I worry the added complication will lead to more problems with the design down the road, and it won't be obvious where this functionality exists or why. But I'm probably just overcautious. Do feel free to continue with the effort. We can roll it out and see what reception it gets. |
Many projects I work on contain files that are tracked in git that should not be packaged (such as docs, tests, etc), and files that are not tracked but should be packaged (such as compiled JS, CSS assets). Integrating with git or any other SCM would be counter productive for the projects I work on, and I suspect many others as well. While the overhead in the projects I work on is not quite as extreme as for @spookylukey, it is still noticeable. A fix would be nice. I've also submitted a patch for pip, and another for setuptools (#733) for similar speed improvements. I was very happy to see a fix for this issue already submitted! @jaraco I understand that you desire a more flexible solution to this problem, with multiple ways of specifying files to include. I've also struggled with the limitations of the MANIFEST file format. This PR solves a problem that people are experiencing right now though. Merging this PR would not prevent you from implementing your Discovery class in the future, and when the Discovery class is implemented this solution could be made more generic so more classes could reuse it. Is there anything I can do to help move this PR along? |
I believe this PR is a work in progress. My last message indicated that we should roll it out, so I'm waiting on a follow-up to say that it's ready to go. @timheap, you can help by reconciling which of these two patches are preferable (or justifying how they're both desirable). |
@jaraco This patch and #733 address different performance problems in different areas, so they are both desirable. They affect unrelated sections of setuptools, the only similarity is the performance fix. I'll have a look to see if I can help finish this PR as well. I installed the combination of this PR and mine and it made setuptools very speedy! |
@timheap thanks for the offer of help. I'm unlikely to have time to look more at this in coming months. In terms of implementation, I don't think there was a huge amount more to do on this PR, although there is at least one TODO in there. However, more tests would be nice. I'm afraid I can't remember how much of this code had test coverage. I think it probably needs more testing on other platforms. |
@jaraco thanks for your continued help/patience with this PR, by the way! |
I'm going to close this PR, but to be sure, I'm welcoming further work on it. Just open a new one when you have more code for consideration. |
Refs #249 and #450
This is not a complete PR, but request for the approach to be confirmed before I do more work.
There are some notes:
Code
The approach, as outlined before, is to reduce
findall
to a no-op, so the directory tree is not scanned, and haveinclude_pattern
(which handles all 'include' type directives) execute anos.walk
directly, but trimming the search tree according to the different types of rules.It currently doesn't handle all cases correctly (specific
graft
doesn't work correctly), but is a proof of concept. If the approach is accepted, I would add more tests (since we can no longer rely on distutils have tests since we are overriding the implementation) and fix this bug. All the tests in setuptools pass with the patch.Some testing on Windows will be necessary, the directory path separator thing gets quite involved, I don't have a Windows box I could easily use.
Performance tests are hard - we really want to test that other parts of the filesystem are not being scanned, which is hard to do.
Removal of
_add_egg_info
function, called fromadd_defaults
I believe this is correct. The behaviour of
add_defaults
seems to be tested bytest_egg_base_installed_egg_info
andtest_install_requires_with_markers
which both fail if you break it, along with a few other tests that fail.AFAICS, the code is only working around the fact that
include_pattern
queries theallfiles
attribute and not the filesystem, and is therefore out of date with respect to files that have been created by setuptools itself. Since we've changed the behaviour of include_pattern, I don't think this is necessary, and the tests seem to verify this.Metrics
Obviously this depends on how many files you've got in your working folder, and on whether the directory listings are in the OS cache.
I used this project as a not-too-extreme example, and because I actually work on this project:
https://github.com/django-money/django-money
It uses tox to test against about 80 environments, which are stored as virtualenvs in .tox
Without patch
From a 'cold' state, but with virtualenvs created from previous usage, I ran the tests on a single environment:
This took 647 seconds (10 minutes), of which just 2.8 seconds were the actual tests i.e. 645 seconds of overhead from setuptools and tox.
Doing it a second time, this reduced to 37 seconds of overhead, due to OS caching of filesystem.
With this patch
With the patch, the setuptools + tox overhead of the same command reduced to just 2 seconds, whether from warm or cold i.e. a speed up of 18 times to 320 times (and more importantly, down to "a few seconds", an acceptable time to wait for your tests to start running). This would make a BIG difference to workflow on quite a few projects I use.
To reproduce this speedup, however, you will need to install the dev version of setuptools in both the outer project (in which tox is installed) and in the virtualenvs it created.
There are obviously cases and projects where this strategy will be less efficient than the previous one, because it requires multiple
os.walk
walks over the file system, instead of just one, but for normal projects, and for any that contain large numbers of non source files, it is likely to be a very bigwin.
Notes
There will be no speedups if MANIFEST.in contains a global-include directive, and in fact probably severe slow-downs if you have lots of files or lots of global-include directives. This is because you cannot trim the search tree at all for this directive - a global-include includes all subdirectories, including hidden ones such as ".tox", ".my_IDE_like_to_put_stuff_here" etc. This means that in general the global-include directive should be avoided, because most tools expect to be able to put all kinds of things in hidden folders and have them ignored. This warning should probably be submitted to distutils as a documentation patch.