Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(git): avoid unnecessary file hashing while config files detection #6461

Merged
merged 6 commits into from
Sep 25, 2024

Conversation

vvagaytsev
Copy link
Collaborator

@vvagaytsev vvagaytsev commented Sep 23, 2024

What this PR does / why we need it:

As it was mentioned in #5920, Garden Core calls AbstractGitHandler.getFiles() method twice:

  • While looking for Garden config files
  • While actions/modules resolution

For big projects with a large amount of untracked files that can cause performance downgrade, because each calls trigger the hash calculation of the untracked files.

This PR makes a simple improvement by disabling hash-calculation in the config files detection phase.

It improves a performance only in a specific scenario, when too many files untracked by Git are included into the Garden.

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The performance difference can be compared by running validate -l3 Garden command in the root of https://github.com/garden-io/garden repo with modified https://github.com/garden-io/garden/blob/main/.gardenignore file. Delete node_modules from .gardenignore before running.

Before

ℹ git                       → [verbose] Scanning project root at /Users/vladimirvagaytsev/Repositories/garden
  → Includes: (none)
  → Excludes: (none)
ℹ git                       → [verbose] Ignoring symlink with absolute target at /Users/vladimirvagaytsev/Repositories/garden/core/test/data/test-projects/build-dir/symlink-absolute/symlink.txt
ℹ git                       → [verbose] Found 127156 files in project root /Users/vladimirvagaytsev/Repositories/garden (took 9.98 sec)
...
ℹ graph                     → Resolving actions and modules...
ℹ git                       → [verbose] Scanning module e2e-tests root at /Users/vladimirvagaytsev/Repositories/garden
  → Includes: (none)
  → Excludes: (none)
ℹ git                       → [verbose] Found 127156 files in module e2e-tests root /Users/vladimirvagaytsev/Repositories/garden (took 9.97 sec)
⚠ graph                     → Large number of files (127156) found in module e2e-tests. You may need to configure file exclusions.
See https://docs.garden.io/using-garden/configuration-overview#including-excluding-files-and-directories for details.
✔ graph                     → Finished resolving graph (took 13.4 sec)
ℹ garden                    → Finished initializing Garden (took 25.1 sec)

After

ℹ git                       → [verbose] Scanning project root at /Users/vladimirvagaytsev/Repositories/garden
  → Includes: (none)
  → Excludes: (none)
ℹ git                       → [verbose] Ignoring symlink with absolute target at /Users/vladimirvagaytsev/Repositories/garden/core/test/data/test-projects/build-dir/symlink-absolute/symlink.txt
ℹ git                       → [verbose] Found 127156 files in project root /Users/vladimirvagaytsev/Repositories/garden (took 1.73 sec)
...
ℹ graph                     → Resolving actions and modules...
ℹ git                       → [verbose] Scanning module e2e-tests root at /Users/vladimirvagaytsev/Repositories/garden
  → Includes: (none)
  → Excludes: (none)
ℹ git                       → [verbose] Found 127156 files in module e2e-tests root /Users/vladimirvagaytsev/Repositories/garden (took 10.79 sec)
⚠ graph                     → Large number of files (127156) found in module e2e-tests. You may need to configure file exclusions.
See https://docs.garden.io/using-garden/configuration-overview#including-excluding-files-and-directories for details.
✔ graph                     → Finished resolving graph (took 15.8 sec)
ℹ garden                    → Finished initializing Garden (took 19.5 sec)

@vvagaytsev
Copy link
Collaborator Author

vvagaytsev commented Sep 23, 2024

@thsig @stefreak @edvald this is a naive attempt to do a cheap and simple opmitization for hash calculation in the case then there are many-many files untracked by Git are (implicitly) included in the Garden working tree.

This PR just disables the hash calculation in the config files detection phase. I'm not sure if it's a good idea.

It actually does not solve the problem with a large number of untracked files. It hides that problem instead.

In some cases users might not want to include all these untracked files. Those files can be generated by some build jobs or another scripts. Such generated files are not always supposed to be included in the working file set of the Garden project.

It might be better to want user loudly in case if a big number of untracked files is detected.

Another attempt was done in #5920, as Jon suggested in the comment, we could use TreeCache for untracked files, but it would have the same effect as this PR, just the implementation would be different.
The hashes for the large number of the untracked files will still be fully calculated, just once and not twice.

Any global ideas on dealing with the large number of untracked files? We can also log a separate warning if these number of untracked files is too large. But I'm not sure it that improves the UX a lot. A trial was done in #6452.

@stefreak
Copy link
Member

stefreak commented Sep 23, 2024

@vvagaytsev in case the large number of untracked files are included in an action, the user will get a warning

Large number of files (127156) found in module e2e-tests. You may need to configure file exclusions.
See https://docs.garden.io/using-garden/configuration-overview#including-excluding-files-and-directories for details.

So how do we hide a problem? If the user doesn't want those files to be included, they need to configure the action not to include them.

Once we get to the point that we only calculate the hash if the file is included in the action, it doesn't matter how the user excludes the file (using the gardenignore feature or includes/excludes).

In the end, we can't know if the files need to be excluded or included and so we shouldn't send the user to a potentially wrong direction. If the files are not included, they shouldn't slow down Garden. If they are included, the user should indeed be informed that it might slow down things. The existing warning can definitely be improved to be more informational, e.g. provide means for understanding which files are causing the issue exactly.

@vvagaytsev
Copy link
Collaborator Author

@stefreak yes, you're right.
We do have a warning printed on the large number of included files, both tracked and untracked. I don't think it makes any sense to introduce one more warning for untracked files only.

Regarding the problem "hiding", sorry for the confusion, let me elaborate that a bit.

I meant that the changes made in this PR can "hide" a bigger problem. This PR is not the proper solution, but just a quick-fix to reduce the processing time and make it not-too-slow for "some" (not too large) number of untracked files. Maybe "deferring a problem" would be a better wording here :)

This PR covers a specific case, but does not fix the performance issue for a "large enough" number of untracked files.
Thus, I'm not sure if it's a good idea to merge this.

@stefreak
Copy link
Member

@vvagaytsev I think it might make sense to change the warning to only warn you if there is a large number of untracked files, if that's the only case that can cause a performance issue; Definitely open to think about that separately 👍

It feels to me that this PR gives a tangible performance improvement already, and while it does not solve the problem entirely it does not make it worse or hide it so I am pro merging this 👍

@stefreak
Copy link
Member

There is a test failure though, maybe avoiding the hashing in the first pass is causing a regression?

@vvagaytsev
Copy link
Collaborator Author

@stefreak right, actually, we should notify users on the large number of untracked and modified files, i.e. the files we need to calculate hashes for.
Having 10000+ is not a problem if all of those are committed to Git, because the hashes are already calculated.

I'll take a closer look at the test failures.

@vvagaytsev vvagaytsev force-pushed the perf/avoid-unnecessary-file-hashing branch from a2f9a8e to 68b2c67 Compare September 24, 2024 09:35
@vvagaytsev vvagaytsev force-pushed the perf/avoid-unnecessary-file-hashing branch from 68b2c67 to cb51de9 Compare September 24, 2024 10:09
@vvagaytsev vvagaytsev changed the title perf(git): avoid unnecessary file hashing perf(git): avoid unnecessary file hashing while config files detection Sep 24, 2024
@vvagaytsev vvagaytsev force-pushed the perf/avoid-unnecessary-file-hashing branch from cb51de9 to d55cc02 Compare September 24, 2024 11:23
@vvagaytsev
Copy link
Collaborator Author

@stefreak the tests should be green now. Please review. I'll change the warning on the large number of untracked files in a separate PR.

@vvagaytsev vvagaytsev marked this pull request as ready for review September 24, 2024 11:54
@edvald
Copy link
Collaborator

edvald commented Sep 24, 2024

Are we now using the repo scan mode when scanning for configs? If we are, I think we may want to address that, since scanning for configs is otherwise quite efficient and shouldn't trigger any file hashing at all.

@vvagaytsev
Copy link
Collaborator Author

Are we now using the repo scan mode when scanning for configs? If we are, I think we may want to address that, since scanning for configs is otherwise quite efficient and shouldn't trigger any file hashing at all.

@edvald yes, we use repo scan mode. The config files scanner uses the same VCS handler instance as the Garden does, and the repo scan mode has been a default one since 0.13.20.

@edvald
Copy link
Collaborator

edvald commented Sep 24, 2024

Got it. I think we should look into that, could be a performance drag when scanning for configs in many cases.

Copy link
Member

@stefreak stefreak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with @jon if there is a performance benefit using the other scan mode for configs, but aren't we caching some of the scan results? If so, let's only change the scan mode if it gives a measurable improvement.

Let's address that in a separate PR, this is already an improvement and looks good to me!

@edvald
Copy link
Collaborator

edvald commented Sep 25, 2024

Agreed, let's do that in a separate PR. But I'm quite sure that using the other scan mode to find config files will work better, at the very least in commands that don't need to resolve all or many actions, where the repo is fairly large.

@vvagaytsev vvagaytsev added this pull request to the merge queue Sep 25, 2024
Merged via the queue into main with commit a786a50 Sep 25, 2024
40 checks passed
@vvagaytsev vvagaytsev deleted the perf/avoid-unnecessary-file-hashing branch September 25, 2024 09:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants