Skip to content

check_case_conflict is slow in a repo with many files/directories on windows #625

Closed
@guykisel

Description

Hello!

I recently updated to a more recent version of this repo for my team's git hooks and we noticed that check_case_conflict got a lot slower. I believe I've tracked down what's going on and I have a suggested improvement, though I'm not sure if it's a great approach.

For context, this is the Legends of Runeterra game's monorepo :) It's got around 1m files and some are pretty deeply nested in directories (a little bit more on this at https://technology.riotgames.com/news/legends-runeterra-cicd-pipeline if anyone's curious)

Testing done using Python 3.6.3 on Windows 10 with pre-commit-hooks==4.0.1 and pre-commit==2.13.0

#575 introduced also checking directories, which is great! But this snippet ends up being slow because os.path.dirname is slow (at least on windows :( )

def parents(file: str) -> Iterator[str]:
    file = os.path.dirname(file)
    while file:
        yield file
        file = os.path.dirname(file)

I used cProfile to run this against a single file:

from pre_commit_hooks import check_case_conflict
import cProfile
cProfile.run('check_case_conflict.find_conflicting_filenames("Jenkinsfile")')

results:

λ py test.py
         41899740 function calls in 20.114 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.047    0.047   20.114   20.114 <string>:1(<module>)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:38(_remove)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:81(add)
        3    0.000    0.000    0.277    0.092 check_case_conflict.py:13(lower_set)
        3    0.186    0.062    0.277    0.092 check_case_conflict.py:14(<setcomp>)
  3189815    1.179    0.000   17.790    0.000 check_case_conflict.py:17(parents)
        2    0.000    0.000   18.831    9.416 check_case_conflict.py:24(directories_for)
        2    1.042    0.521   18.831    9.416 check_case_conflict.py:25(<setcomp>)
        1    0.147    0.147   20.066   20.066 check_case_conflict.py:28(find_conflicting_filenames)
  3189815    3.710    0.000    5.528    0.000 ntpath.py:121(splitdrive)
  3189815    7.833    0.000   15.701    0.000 ntpath.py:199(split)
  3189815    0.910    0.000   16.611    0.000 ntpath.py:240(dirname)
  3189815    0.813    0.000    1.216    0.000 ntpath.py:33(_get_bothseps)
        2    0.000    0.000    0.000    0.000 subprocess.py:1023(_internal_poll)
        2    0.000    0.000    0.022    0.011 subprocess.py:1040(wait)
        2    0.000    0.000    0.665    0.332 subprocess.py:1067(_communicate)
       22    0.000    0.000    0.000    0.000 subprocess.py:179(Close)
...

I messed with it a bit locally and this approach seems to perform significantly better:

def parents(file: str) -> Iterator[str]:
    path_parts = file.split(os.sep)
    file = path_parts[-1]
    while path_parts:
        yield file
        file = path_parts.pop()
λ py test.py
         1962384 function calls in 2.469 seconds

   Ordered by: standard name

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.042    0.042    2.469    2.469 <string>:1(<module>)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:38(_remove)
        4    0.000    0.000    0.000    0.000 _weakrefset.py:81(add)
        3    0.000    0.000    0.283    0.094 check_case_conflict.py:13(lower_set)
        3    0.195    0.065    0.283    0.094 check_case_conflict.py:14(<setcomp>)
   784800    0.369    0.000    0.670    0.000 check_case_conflict.py:17(parents)
        2    0.000    0.000    1.079    0.539 check_case_conflict.py:26(directories_for)
        2    0.409    0.204    1.079    0.539 check_case_conflict.py:27(<setcomp>)
        1    0.231    0.231    2.427    2.427 check_case_conflict.py:30(find_conflicting_filenames)
        2    0.000    0.000    0.000    0.000 subprocess.py:1023(_internal_poll)
        2    0.000    0.000    0.020    0.010 subprocess.py:1040(wait)
        2    0.000    0.000    0.695    0.348 subprocess.py:1067(_communicate)
...

But it doesn't really feel good doing this instead of using the built-in os.path stuff. I'm not really sure how correct/robust/multi-OS-compatible this is.

I'm curious if anyone has alternative ideas! But also if this is simply out of scope for this tool that's fine too :) Thanks for reading!

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions