Skip to content

improve performance by only looking up git information once#191

Open
fivetanley wants to merge 10 commits intomainfrom
fivetanley/performance-updates
Open

improve performance by only looking up git information once#191
fivetanley wants to merge 10 commits intomainfrom
fivetanley/performance-updates

Conversation

@fivetanley
Copy link
Collaborator

@fivetanley fivetanley commented Feb 5, 2026

🛠️ Description

  • use filepath.walkDir instead of filepath.Walk. filepath.walkDir can be faster than Walk due to not calling os.Lstat on every file. This allows us to call os.Lstat from the goroutines in parallel instead of always blocking in the main thread. We also avoid calling os.Lstat on files in the ignore pattern
  • Only calculate the repository's first commit year once per run, instead of on every file lookup
  • Pass repoRoot to functions instead of calculating it every single time
  • Use git log across the repository to grab the first commit year and build an index of [fileName] = yearOfLastCommit. This prevents us from calling git log on every single file. Since git log --reverse will traverse the whole git tree anyway, this prevents us from having to iterate the entire git log for every single file.
  • ignore .git/**/*.pack files by default

Improvements:

On https://github.com/hashicorp/cloud-ui, this:

  • takes the runtime down from 11.6 minutes (701.80 seconds) to about 5s first run , 4s on subsequent runs (it runs faster on subsequent runs due to OS filesystem caching)
  • uses a lot less memory. Before: ~5.30gb, After: ~50mb

🔗 External Links

👍 Definition of Done

  • New functionality works?
  • Tests added? - no new tests, but existing tests pass

🤔 Can be merged upon approval?

PCI review checklist

  • I have documented a clear reason for, and description of, the change I am making.

  • If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.

  • If applicable, I've documented the impact of any changes to security controls.

    Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.

- only find the repository's first commit date once instead of on every file
  read
- parallelize checking files on startup by moving file.isDir() check to worker thread
- headers: only calculate repository root once instead of on every file
@fivetanley fivetanley requested a review from a team as a code owner February 5, 2026 17:29

if !dryRun {
updated, err := licensecheck.UpdateCopyrightHeader(path, targetHolder, configYear, false)
updated, err := licensecheck.UpdateCopyrightHeader(path, targetHolder, configYear, false, repoFirstYear, repoRoot)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any external code using these exported functions will break. This is a major version bump. Is it possible to mitigate it by adding wrapper functions for backward compatibility.

continue
}
// If it's a 4-digit year
if year, err := strconv.Atoi(line); err == nil && year > 1900 && year < 2100 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, its rare but the parsing logic could theoretically misidentify a file named with just 4 digits (e.g., 2025) as a commit year. Again this is rare in practice.
If we can be a bit defensive here, we can consider adding the YEAR: prefix to the git log format to eliminate ambiguity. Otherwise, documenting this assumption in a comment would be helpful.

Not a blocker, but worth considering for robustness.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't think about that edge case but I could definitely see it happening. I added a __CW_YEAR__= prefix to the git log and the parsing logic to resolve this

once.Do(func() {
cache, firstYear, err := buildRepositoryCache(repoRoot)
if err != nil {
lastCommitYearsCache = make(map[string]int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know your thoughts: When git cache fails, the tool still works but uses lastCommitYear=0 for all files, falling back to config-based years. This is probably fine, but:

  • We may need to log it for user (debugging becomes easier)
  • Might mask real problems like (git installed but failing, etc)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the logic that was present (see 376 in the pervious version of this file) so I was just maintaining what was there. Happy to add a new log line if that's helpful

Comment on lines +192 to +195
repoFirstYear, _ := licensecheck.GetRepoFirstCommitYear(".")

// Open git repository once for all file operations
repoRoot, _ := licensecheck.GetRepoRoot(".")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if we run this from a subdirectory, or across multiple repos, will it still work correctly? Since the cache would only apply to the parent repo. Please let me know if my understanding is correct.

Copy link
Collaborator Author

@fivetanley fivetanley Feb 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, GetRepoRoot remains unchanged. So if run within a subdirectory, it's shelling out to git to find the root anyway. This function is not cached.

".github/dependabot.yml",
"**/node_modules/**",
".copywrite.hcl",
".git/**/*.pack",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason we shouldn't skip the entire .git folder? I cannot think of a good reason not to.

I'm surprised this was not ignored already to be honest and we accidentally didn't cause some random files there to get updated. 😅

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants