Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-116380: Speed up glob.glob() by removing some system calls #116392

Open
wants to merge 73 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 49 commits
Commits
Show all changes
73 commits
Select commit Hold shift + click to select a range
db3c620
GH-116380: Make `glob.glob()` twice as fast
barneygale Mar 5, 2024
9e1f059
Use `os.listdir()` if we don't need to check entry type.
barneygale Mar 5, 2024
10432df
A few small speedups.
barneygale Mar 6, 2024
7e389e2
Simplify prefix removal
barneygale Mar 6, 2024
8680a0a
Re-implement `glob0()`, `glob1()`, and `has_magic()`.
barneygale Mar 6, 2024
3bf3124
Fix errant `StopIteration`.
barneygale Mar 6, 2024
f8fb992
Skip compiling pattern for consecutive `**` segments.
barneygale Mar 6, 2024
50ef080
Clarify regex/path building in literal and recursive selectors.
barneygale Mar 6, 2024
ccefacd
Simplify code to ignore root_dir.
barneygale Mar 6, 2024
fa951f6
Fix possible Windows separator issue.
barneygale Mar 6, 2024
0aec12c
Address some review feedback.
barneygale Mar 6, 2024
72691ba
Use assignment expressions in a couple of places
barneygale Mar 6, 2024
c58dd21
Replace lambda with `operator.not_`.
barneygale Mar 6, 2024
c361ec9
Merge branch 'main' into gh-116380
barneygale Mar 6, 2024
22b30db
Speed up `_add_trailing_slash()`
barneygale Mar 6, 2024
83b70bd
Speed up `select_literal()`
barneygale Mar 7, 2024
1d32d14
Speed up `select_recursive()`
barneygale Mar 7, 2024
1e5aacc
Merge branch 'main' into gh-116380
barneygale Mar 17, 2024
a038bb8
Merge branch 'main' into gh-116380
barneygale Mar 18, 2024
f1440a9
Cache compiled patterns rather than selectors.
barneygale Mar 18, 2024
9c64643
Remove a bit of code duplication.
barneygale Mar 18, 2024
b0e8ba6
Fix stray newline
barneygale Mar 19, 2024
1b1233e
Merge branch 'main' into gh-116380
barneygale Mar 22, 2024
0e02ec5
Remove tests for glob0 and glob1
barneygale Mar 28, 2024
be4865e
Add a bunch of comments explaining the more subtle parts.
barneygale Mar 29, 2024
203e8ef
Merge branch 'main' into gh-116380
barneygale Apr 1, 2024
13355a0
Clarify variable naming in iglob()
barneygale Apr 3, 2024
2e5cebd
Use keyword arguments to pass True/False/None literals, for clarity.
barneygale Apr 4, 2024
5eba2eb
Speed up recursive globbing very slightly
barneygale Apr 4, 2024
b0a99b7
Merge branch 'main' into gh-116380
barneygale Apr 5, 2024
ad0ece8
Implement recursive wildcards with a stack
barneygale Apr 5, 2024
cafe9be
Add argument defaults, simplify code slightly.
barneygale Apr 5, 2024
301d922
Also make rel_path optional
barneygale Apr 5, 2024
beb2507
Optimise _add_trailing_slash
barneygale Apr 5, 2024
312c73a
Remove use of os.listdir() -- doesn't generalise
barneygale Apr 6, 2024
ae820e2
Add `_Globber` class; prepare for merger with pathlib globbing.
barneygale Apr 6, 2024
dcfe11d
Unify with pathlib implementation \o/
barneygale Apr 6, 2024
123a0f6
Use literal selector only if no case sensitivity preference is given.
barneygale Apr 6, 2024
0ed7b9c
Fix a few tests
barneygale Apr 6, 2024
aceb85f
Fix a few more tests.
barneygale Apr 6, 2024
b04de9d
Merge commit '689ada79150f28b0053fa6c1fb646b75ab2cc200' into gh-116380
barneygale Apr 10, 2024
3eb2d19
Merge branch 'main' into gh-116380
barneygale Apr 10, 2024
8a15db0
Fix select() argument order.
barneygale Apr 10, 2024
7eb3e61
Merge branch 'main' into gh-116380
barneygale Apr 12, 2024
316ea56
Merge branch 'main' into gh-116380
barneygale May 3, 2024
2018027
Support `include_hidden` and `dir_fd` in `pathlib._glob`.
barneygale May 3, 2024
2f21626
Fix stray newline
barneygale May 3, 2024
339df68
Update Lib/pathlib/_glob.py
barneygale May 4, 2024
28aa95f
Fix docs
barneygale May 4, 2024
abcb1f8
Test for unique results
barneygale May 4, 2024
71387a6
Spacing
barneygale May 4, 2024
de22de6
Merge branch 'main' into gh-116380
barneygale May 5, 2024
8b08374
Merge branch 'main' into gh-116380
barneygale May 7, 2024
54efa7c
Merge branch 'main' into gh-116380
barneygale May 8, 2024
cf11922
Update whatsnew
barneygale May 8, 2024
6710924
Merge branch 'main' into gh-116380
barneygale May 14, 2024
a547cd2
Merge branch 'main' into gh-116380
barneygale May 31, 2024
14ae438
Close file descriptors when `recursive_selector` is finalized.
barneygale May 31, 2024
69d7a86
Make `iglob()` a generator.
barneygale May 31, 2024
3b84a1d
Make `_iglob()` a generator.
barneygale May 31, 2024
f9f9a8d
Make `_relative_glob()` a generator.
barneygale May 31, 2024
24a9ee4
Simplify skipping empty string
barneygale May 31, 2024
d05d58d
Merge branch 'main' into gh-116380
barneygale Jun 4, 2024
27c463e
Merge branch 'main' into gh-116380
barneygale Jun 7, 2024
a94f2a7
Make `_GlobberBase` fully abstract.
barneygale Jun 7, 2024
d19bb89
Address review feedback
barneygale Jun 9, 2024
1677588
Typo fix
barneygale Jun 9, 2024
539f044
Speed up pattern parsing.
barneygale Jun 9, 2024
70a1b42
Add test for globbing above recursion limit.
barneygale Jun 12, 2024
1560712
Merge branch 'main' into gh-116380
barneygale Aug 26, 2024
099e86e
Apply suggestions from code review
barneygale Sep 1, 2024
ee76faf
Test that `iglob().close()` closes file descriptors.
barneygale Sep 1, 2024
4cf8a4d
Address some review feedback
barneygale Sep 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 10 additions & 8 deletions Doc/library/glob.rst
Original file line number Diff line number Diff line change
Expand Up @@ -75,10 +75,6 @@ The :mod:`glob` module defines the following functions:
Using the "``**``" pattern in large directory trees may consume
an inordinate amount of time.

.. note::
This function may return duplicate path names if *pathname*
contains multiple "``**``" patterns and *recursive* is true.

.. versionchanged:: 3.5
Support for recursive globs using "``**``".

Expand All @@ -88,6 +84,11 @@ The :mod:`glob` module defines the following functions:
.. versionchanged:: 3.11
Added the *include_hidden* parameter.

.. versionchanged:: 3.14
barneygale marked this conversation as resolved.
Show resolved Hide resolved
Matching path names are returned only once. In previous versions, this
barneygale marked this conversation as resolved.
Show resolved Hide resolved
function may return duplicate path names if *pathname* contains multiple
"``**``" patterns and *recursive* is true.


.. function:: iglob(pathname, *, root_dir=None, dir_fd=None, recursive=False, \
include_hidden=False)
Expand All @@ -98,10 +99,6 @@ The :mod:`glob` module defines the following functions:
.. audit-event:: glob.glob pathname,recursive glob.iglob
.. audit-event:: glob.glob/2 pathname,recursive,root_dir,dir_fd glob.iglob

.. note::
This function may return duplicate path names if *pathname*
contains multiple "``**``" patterns and *recursive* is true.

.. versionchanged:: 3.5
Support for recursive globs using "``**``".

Expand All @@ -111,6 +108,11 @@ The :mod:`glob` module defines the following functions:
.. versionchanged:: 3.11
Added the *include_hidden* parameter.

.. versionchanged:: 3.14
Matching path names are yielded only once. In previous versions, this
function may yield duplicate path names if *pathname* contains multiple
"``**``" patterns and *recursive* is true.


.. function:: escape(pathname)

Expand Down
232 changes: 50 additions & 182 deletions Lib/glob.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,11 @@
"""Filename globbing utility."""

import contextlib
import os
import fnmatch
import itertools
import stat
import operator
import sys

from pathlib._glob import translate, magic_check, magic_check_bytes
from pathlib._glob import translate, magic_check, magic_check_bytes, Globber

__all__ = ["glob", "iglob", "escape", "translate"]

Expand Down Expand Up @@ -43,82 +41,33 @@ def iglob(pathname, *, root_dir=None, dir_fd=None, recursive=False,
"""
sys.audit("glob.glob", pathname, recursive)
sys.audit("glob.glob/2", pathname, recursive, root_dir, dir_fd)
if root_dir is not None:
root_dir = os.fspath(root_dir)
pathname = os.fspath(pathname)
is_bytes = isinstance(pathname, bytes)
if is_bytes:
pathname = os.fsdecode(pathname)
if root_dir is not None:
root_dir = os.fsdecode(root_dir)
anchor, parts = _split_pathname(pathname)

globber = Globber(recursive=recursive, include_hidden=include_hidden)
select = globber.selector(parts)
if anchor:
# Non-relative pattern. The anchor is guaranteed to exist unless it
# has a Windows drive component.
exists = not os.path.splitdrive(anchor)[0]
barneygale marked this conversation as resolved.
Show resolved Hide resolved
paths = select(anchor, dir_fd, anchor, exists)
else:
root_dir = pathname[:0]
it = _iglob(pathname, root_dir, dir_fd, recursive, False,
include_hidden=include_hidden)
if not pathname or recursive and _isrecursive(pathname[:2]):
try:
s = next(it) # skip empty string
if s:
it = itertools.chain((s,), it)
except StopIteration:
pass
return it

def _iglob(pathname, root_dir, dir_fd, recursive, dironly,
include_hidden=False):
dirname, basename = os.path.split(pathname)
if not has_magic(pathname):
assert not dironly
if basename:
if _lexists(_join(root_dir, pathname), dir_fd):
yield pathname
else:
# Patterns ending with a slash should match only directories
if _isdir(_join(root_dir, dirname), dir_fd):
yield pathname
return
if not dirname:
if recursive and _isrecursive(basename):
yield from _glob2(root_dir, basename, dir_fd, dironly,
include_hidden=include_hidden)
else:
yield from _glob1(root_dir, basename, dir_fd, dironly,
include_hidden=include_hidden)
return
# `os.path.split()` returns the argument itself as a dirname if it is a
# drive or UNC path. Prevent an infinite recursion if a drive or UNC path
# contains magic characters (i.e. r'\\?\C:').
if dirname != pathname and has_magic(dirname):
dirs = _iglob(dirname, root_dir, dir_fd, recursive, True,
include_hidden=include_hidden)
else:
dirs = [dirname]
if has_magic(basename):
if recursive and _isrecursive(basename):
glob_in_dir = _glob2
else:
glob_in_dir = _glob1
else:
glob_in_dir = _glob0
for dirname in dirs:
for name in glob_in_dir(_join(root_dir, dirname), basename, dir_fd, dironly,
include_hidden=include_hidden):
yield os.path.join(dirname, name)

# These 2 helper functions non-recursively glob inside a literal directory.
# They return a list of basenames. _glob1 accepts a pattern while _glob0
# takes a literal basename (so it only has to check for its existence).

def _glob1(dirname, pattern, dir_fd, dironly, include_hidden=False):
names = _listdir(dirname, dir_fd, dironly)
if not (include_hidden or _ishidden(pattern)):
names = (x for x in names if not _ishidden(x))
return fnmatch.filter(names, pattern)

def _glob0(dirname, basename, dir_fd, dironly, include_hidden=False):
if basename:
if _lexists(_join(dirname, basename), dir_fd):
return [basename]
else:
# `os.path.split()` returns an empty basename for paths ending with a
# directory separator. 'q*x/' should match only directories.
if _isdir(dirname, dir_fd):
return [basename]
return []
# Relative pattern.
if root_dir is None:
root_dir = os.path.curdir
paths = _relative_glob(select, root_dir, dir_fd)

# Ensure that the empty string is not yielded when given a pattern
# like '' or '**'.
paths = itertools.dropwhile(operator.not_, paths)
if is_bytes:
paths = map(os.fsencode, paths)
return paths

_deprecated_function_message = (
"{name} is deprecated and will be removed in Python {remove}. Use "
Expand All @@ -128,102 +77,33 @@ def _glob0(dirname, basename, dir_fd, dironly, include_hidden=False):
def glob0(dirname, pattern):
import warnings
warnings._deprecated("glob.glob0", _deprecated_function_message, remove=(3, 15))
return _glob0(dirname, pattern, None, False)
return list(_relative_glob(Globber().literal_selector(pattern, []), dirname))

def glob1(dirname, pattern):
import warnings
warnings._deprecated("glob.glob1", _deprecated_function_message, remove=(3, 15))
return _glob1(dirname, pattern, None, False)

# This helper function recursively yields relative pathnames inside a literal
# directory.

def _glob2(dirname, pattern, dir_fd, dironly, include_hidden=False):
assert _isrecursive(pattern)
if not dirname or _isdir(dirname, dir_fd):
yield pattern[:0]
yield from _rlistdir(dirname, dir_fd, dironly,
include_hidden=include_hidden)

# If dironly is false, yields all file names inside a directory.
# If dironly is true, yields only directory names.
def _iterdir(dirname, dir_fd, dironly):
try:
fd = None
fsencode = None
if dir_fd is not None:
if dirname:
fd = arg = os.open(dirname, _dir_open_flags, dir_fd=dir_fd)
else:
arg = dir_fd
if isinstance(dirname, bytes):
fsencode = os.fsencode
elif dirname:
arg = dirname
elif isinstance(dirname, bytes):
arg = bytes(os.curdir, 'ASCII')
else:
arg = os.curdir
try:
with os.scandir(arg) as it:
for entry in it:
try:
if not dironly or entry.is_dir():
if fsencode is not None:
yield fsencode(entry.name)
else:
yield entry.name
except OSError:
pass
finally:
if fd is not None:
os.close(fd)
except OSError:
return

def _listdir(dirname, dir_fd, dironly):
with contextlib.closing(_iterdir(dirname, dir_fd, dironly)) as it:
return list(it)

# Recursively yields relative pathnames inside a literal directory.
def _rlistdir(dirname, dir_fd, dironly, include_hidden=False):
names = _listdir(dirname, dir_fd, dironly)
for x in names:
if include_hidden or not _ishidden(x):
yield x
path = _join(dirname, x) if dirname else x
for y in _rlistdir(path, dir_fd, dironly,
include_hidden=include_hidden):
yield _join(x, y)
return list(_relative_glob(Globber().wildcard_selector(pattern, []), dirname))


def _lexists(pathname, dir_fd):
# Same as os.path.lexists(), but with dir_fd
if dir_fd is None:
return os.path.lexists(pathname)
try:
os.lstat(pathname, dir_fd=dir_fd)
except (OSError, ValueError):
return False
else:
return True

def _isdir(pathname, dir_fd):
# Same as os.path.isdir(), but with dir_fd
if dir_fd is None:
return os.path.isdir(pathname)
try:
st = os.stat(pathname, dir_fd=dir_fd)
except (OSError, ValueError):
return False
else:
return stat.S_ISDIR(st.st_mode)

def _join(dirname, basename):
# It is common if dirname or basename is empty
if not dirname or not basename:
return dirname or basename
return os.path.join(dirname, basename)
def _split_pathname(pathname):
"""Split the given path into a pair (anchor, parts), where *anchor* is the
path drive and root (if any), and *parts* is a reversed list of path parts.
barneygale marked this conversation as resolved.
Show resolved Hide resolved
"""
parts = []
split = os.path.split
dirname, part = split(pathname)
while dirname != pathname:
parts.append(part)
pathname = dirname
dirname, part = split(pathname)
return dirname, parts

def _relative_glob(select, dirname, dir_fd=None):
"""Globs using a select function from the given dirname. The dirname
barneygale marked this conversation as resolved.
Show resolved Hide resolved
prefix is removed from results.
"""
dirname = Globber.add_slash(dirname)
slicer = operator.itemgetter(slice(len(dirname), None))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it more efficient to use this or a plain yield path[dirname_length:]? (with this approach, we can't use a map anymore).

return map(slicer, select(dirname, dir_fd, dirname))

def has_magic(s):
if isinstance(s, bytes):
Expand All @@ -232,15 +112,6 @@ def has_magic(s):
match = magic_check.search(s)
return match is not None

def _ishidden(path):
return path[0] in ('.', b'.'[0])

def _isrecursive(pattern):
if isinstance(pattern, bytes):
return pattern == b'**'
else:
return pattern == '**'

def escape(pathname):
"""Escape all special characters.
"""
Expand All @@ -252,6 +123,3 @@ def escape(pathname):
else:
pathname = magic_check.sub(r'[\1]', pathname)
return drive + pathname


_dir_open_flags = os.O_RDONLY | getattr(os, 'O_DIRECTORY', 0)
8 changes: 5 additions & 3 deletions Lib/pathlib/_abc.py
Original file line number Diff line number Diff line change
Expand Up @@ -403,7 +403,7 @@ def match(self, path_pattern, *, case_sensitive=None):
return False
if len(path_parts) > len(pattern_parts) and path_pattern.anchor:
return False
globber = self._globber(sep, case_sensitive)
globber = self._globber(sep, case_sensitive, include_hidden=True)
for path_part, pattern_part in zip(path_parts, pattern_parts):
match = globber.compile(pattern_part)
if match(path_part) is None:
Expand All @@ -419,7 +419,8 @@ def full_match(self, pattern, *, case_sensitive=None):
pattern = self.with_segments(pattern)
if case_sensitive is None:
case_sensitive = _is_case_sensitive(self.parser)
globber = self._globber(pattern.parser.sep, case_sensitive, recursive=True)
globber = self._globber(pattern.parser.sep, case_sensitive,
recursive=True, include_hidden=True)
match = globber.compile(pattern._pattern_str)
return match(self._pattern_str) is not None

Expand Down Expand Up @@ -694,7 +695,8 @@ def _glob_selector(self, parts, case_sensitive, recurse_symlinks):
# must use scandir() for everything, including non-wildcard parts.
case_pedantic = True
recursive = True if recurse_symlinks else _glob.no_recurse_symlinks
globber = self._globber(self.parser.sep, case_sensitive, case_pedantic, recursive)
globber = self._globber(self.parser.sep, case_sensitive,
case_pedantic, recursive, include_hidden=True)
return globber.selector(parts)

def glob(self, pattern, *, case_sensitive=None, recurse_symlinks=True):
Expand Down
Loading
Loading