Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not scan whole file tree when making MANIFEST #764

Merged
merged 1 commit into from
Oct 15, 2016

Conversation

mx-moth
Copy link
Contributor

@mx-moth mx-moth commented Aug 29, 2016

When building a MANIFEST from a MANIFEST.in, setuptools previously
scanned the whole directory tree in to a list, and then picked matching
files based on MANIFEST.in commands from this list.

Now, files are found using the glob library from Python 3.5. This only
explores directories that need to be scanned, resulting in a large speed
improvement for projects with large file trees. A modified glob
module has been included. It has been changed to support back to
Python 2.6, and to include .hidden files in its matches. The previous
functionality included .hidden files in its glob matches. It is
unclear if this behaviour is desired and required, or accidental and not
required, but for strict backwards-compatibility, this behaviour is
kept.

Each command in the MANIFEST.in is now represented by its own function
on the FileList (include, exclude, graft, etc.) to allow for an
efficient implementation. The previous commands
FileList.include_pattern and FileList.exclude_pattern still
exist for backwards compatibility, but these use the slow 'scan all
files' method, so are discouraged. global_include by its nature must
scan all directories in the project to work, so this does not receive
any speed improvements.

The changes will speed up creating packages for the vast majority of
users. There are a few unusual corner cases, such as multiple graft
commands operating on the same set of directories, that will be slower.
These can be solved by consolidating the overlapping graft commands in
to one command.

Note that this is a competing PR to #610. The approach in #610 made it hard to correctly implement graft.

@mx-moth mx-moth force-pushed the fast_egg_info_command branch 3 times, most recently from 7de66bb to 5c655a8 Compare August 29, 2016 05:48
@mx-moth
Copy link
Contributor Author

mx-moth commented Aug 29, 2016

These tests seem to be failing on CI for unrelated reasons. The tests fail even on master currently. I do not know why, and the stack traces are rather confusing.

@mx-moth
Copy link
Contributor Author

mx-moth commented Aug 29, 2016

Oh, py.test v3 is still broken. Running the tests with py.test>=2.8,<3 works just fine.

@mx-moth
Copy link
Contributor Author

mx-moth commented Aug 29, 2016

OK, test pass again. I've rebased this on top of #765, which excludes py.test>=3.0 completely.

@benoit-pierre
Copy link
Member

This patch is great, and fix my issue with following symlinks when building the manifest. I however get a test failure on my system:

self = <setuptools.tests.test_manifest.TestFileListTest object at 0x2b1c7b96d850>

    def test_include_pattern(self):
        # return False if no match
        file_list = FileList()
        self.make_files([])
        assert not file_list.include_pattern('*.py')

        # return True if files match
        file_list = FileList()
        self.make_files(['a.py', 'b.txt'])
        assert file_list.include_pattern('*.py')

        # test * matches all files
        file_list = FileList()
        self.make_files(['a.py', 'b.txt'])
        file_list.include_pattern('*')
>       assert file_list.files == ['a.py', 'b.txt']
E       assert ['b.txt', 'a.py'] == ['a.py', 'b.txt']
E         At index 0 diff: 'b.txt' != 'a.py'
E         Use -v to get the full diff

This is both with Python 2.7.12 and Python 3.5.2 on Arch Linux.

@mx-moth
Copy link
Contributor Author

mx-moth commented Sep 15, 2016

Thanks for checking this.

The order of files in the file list appears to depend on the underlying file system. My system reported the files in the order they were created, but apparently yours does not. I've wrapped the file list in a set() for the comparison, as the order is not part of the test.

There is already a separate test that checks that FileList.sort() does the sensible thing.

@mx-moth
Copy link
Contributor Author

mx-moth commented Sep 28, 2016

The code for this patch is complete, barring any feedback. Is there anything more I can do to help this patch along?

Copy link
Member

@jaraco jaraco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks good. I would simply accept it except for the questions I raise above. Please have a look. Thanks.

return found

def include(self, pattern):
"""Include files that match 'pattera'n."""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment looks like it has a typo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -239,7 +322,151 @@ def check_broken_egg_info(self):


class FileList(_FileList):
"""File list that accepts only existing, platform-independent paths"""
# Implementations of the various MANIFEST.in commands
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, this class was a subclass of the implementation in distutils. It's not clear from this diff if there's still value in subclassing _FileList. Is this a complete re-implementation or does it also re-use some of the upstream implementation? Do you expect that these changes could be merged upstream at some point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I kept it for backwards compatibility. Packages or plugins may depend on calling FileList.include_pattern, the old way of adding packages, and I did not want to break these.

self._add_egg_info(cmd=ei_cmd)
self.filelist.include_pattern("*", prefix=ei_cmd.egg_info)

def _add_egg_info(self, cmd):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, cmd.egg_base was processed, but now it appears not to be. What new behavior obviates the need for this behavior?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behaviour is still required, but is now as simple as self.filelist.graft(ei_cmd.egg_info), assuming I understand what is going in _add_egg_info(). I've replaced that whole function with that one-liner.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay. Great!

"""
Filename globbing utility. Mostly a copy of `glob` from Python 3.5.

Changes include:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file adds a big amount of technical debt to the project. Although I believe it will work, it's a fork of the functionality in Python 3.5 without a merge strategy. Moreover, it has already diverged from the upstream implementation. Thanks, by the way for describing the divergence (I worry other contributors might not be so diligent).

But I wonder - if you could rely on Python 3.5, would the implementation have been different, or would you have had to override all of this behavior not to ignore hidden files? For example, if you look at the recent setuptools.sdist.py36compat.py, you'll see functionality that's a backport of code in Python 3.7, which can be removed when Python 3.7 is the minimum supported version. Could you do something similar here, where the forward-compatibility implementation and the divergent behavior are maintained separately (even in separate files)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would not be possible to ship the glob library from Py3.5 without any modifications, as yield from is not valid syntax in older Pythons, so at least that needs to change regardless.

To make the library compatible with Py2, I also had to use six for dealing with strings/bytes. These changes require modifications to internal function code, so are not easily patched from "outside". I notice that I did not mention this in the list of changes, I will add it now.

The same problem exists with including hidden files - is_hidden is called from a number of functions, so these functions need to be changed to not call it. Alternatively, is_hidden could be changed to always return False, which would have the same effect.

Finally, all the functions in the glob library call each other directly. There is no good way of doing dependency injection or something similar to override the specific functions that need patching.

I completely agree with you that this adds some undesirable technical debt. If you can think of any good solutions to the problems above I am very happy to implement them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is also worth noting that a diff -U5 /usr/lib/python3.5/glob.py setuptools/glob.py shows a very nice summary of the changes.

@mx-moth mx-moth force-pushed the fast_egg_info_command branch 2 times, most recently from d31f12e to 44630f1 Compare October 14, 2016 22:10
@jaraco jaraco merged commit 566dc35 into pypa:master Oct 15, 2016
@mx-moth
Copy link
Contributor Author

mx-moth commented Oct 15, 2016

Thanks for the merge!

mx-moth added a commit to mx-moth/setuptools that referenced this pull request Nov 20, 2016
After pypa#764, `global-exclude .pyc` no longer excluded `.pyc` files.
This fixes that regression, and adds a test for this behaviour.
@spookylukey
Copy link
Contributor

@timheap thanks for taking this on and succeeding where I ran out of steam! and thanks @jaraco for the merge. This has made a world of difference to my use case - tox-dev/tox#293 (comment)

reinout added a commit to nens/cookiecutter-djangosite-template that referenced this pull request Feb 9, 2018
Fixed in setuptools (quite some time ago) in pypa/setuptools#764
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants