check_case_conflict is slow in a repo with many files/directories on windows #625
Description
Hello!
I recently updated to a more recent version of this repo for my team's git hooks and we noticed that check_case_conflict got a lot slower. I believe I've tracked down what's going on and I have a suggested improvement, though I'm not sure if it's a great approach.
For context, this is the Legends of Runeterra game's monorepo :) It's got around 1m files and some are pretty deeply nested in directories (a little bit more on this at https://technology.riotgames.com/news/legends-runeterra-cicd-pipeline if anyone's curious)
Testing done using Python 3.6.3 on Windows 10 with pre-commit-hooks==4.0.1
and pre-commit==2.13.0
#575 introduced also checking directories, which is great! But this snippet ends up being slow because os.path.dirname
is slow (at least on windows :( )
def parents(file: str) -> Iterator[str]:
file = os.path.dirname(file)
while file:
yield file
file = os.path.dirname(file)
I used cProfile
to run this against a single file:
from pre_commit_hooks import check_case_conflict
import cProfile
cProfile.run('check_case_conflict.find_conflicting_filenames("Jenkinsfile")')
results:
λ py test.py
41899740 function calls in 20.114 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.047 0.047 20.114 20.114 <string>:1(<module>)
4 0.000 0.000 0.000 0.000 _weakrefset.py:38(_remove)
4 0.000 0.000 0.000 0.000 _weakrefset.py:81(add)
3 0.000 0.000 0.277 0.092 check_case_conflict.py:13(lower_set)
3 0.186 0.062 0.277 0.092 check_case_conflict.py:14(<setcomp>)
3189815 1.179 0.000 17.790 0.000 check_case_conflict.py:17(parents)
2 0.000 0.000 18.831 9.416 check_case_conflict.py:24(directories_for)
2 1.042 0.521 18.831 9.416 check_case_conflict.py:25(<setcomp>)
1 0.147 0.147 20.066 20.066 check_case_conflict.py:28(find_conflicting_filenames)
3189815 3.710 0.000 5.528 0.000 ntpath.py:121(splitdrive)
3189815 7.833 0.000 15.701 0.000 ntpath.py:199(split)
3189815 0.910 0.000 16.611 0.000 ntpath.py:240(dirname)
3189815 0.813 0.000 1.216 0.000 ntpath.py:33(_get_bothseps)
2 0.000 0.000 0.000 0.000 subprocess.py:1023(_internal_poll)
2 0.000 0.000 0.022 0.011 subprocess.py:1040(wait)
2 0.000 0.000 0.665 0.332 subprocess.py:1067(_communicate)
22 0.000 0.000 0.000 0.000 subprocess.py:179(Close)
...
I messed with it a bit locally and this approach seems to perform significantly better:
def parents(file: str) -> Iterator[str]:
path_parts = file.split(os.sep)
file = path_parts[-1]
while path_parts:
yield file
file = path_parts.pop()
λ py test.py
1962384 function calls in 2.469 seconds
Ordered by: standard name
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.042 0.042 2.469 2.469 <string>:1(<module>)
4 0.000 0.000 0.000 0.000 _weakrefset.py:38(_remove)
4 0.000 0.000 0.000 0.000 _weakrefset.py:81(add)
3 0.000 0.000 0.283 0.094 check_case_conflict.py:13(lower_set)
3 0.195 0.065 0.283 0.094 check_case_conflict.py:14(<setcomp>)
784800 0.369 0.000 0.670 0.000 check_case_conflict.py:17(parents)
2 0.000 0.000 1.079 0.539 check_case_conflict.py:26(directories_for)
2 0.409 0.204 1.079 0.539 check_case_conflict.py:27(<setcomp>)
1 0.231 0.231 2.427 2.427 check_case_conflict.py:30(find_conflicting_filenames)
2 0.000 0.000 0.000 0.000 subprocess.py:1023(_internal_poll)
2 0.000 0.000 0.020 0.010 subprocess.py:1040(wait)
2 0.000 0.000 0.695 0.348 subprocess.py:1067(_communicate)
...
But it doesn't really feel good doing this instead of using the built-in os.path
stuff. I'm not really sure how correct/robust/multi-OS-compatible this is.
I'm curious if anyone has alternative ideas! But also if this is simply out of scope for this tool that's fine too :) Thanks for reading!