improve performance by only looking up git information once#191
improve performance by only looking up git information once#191fivetanley wants to merge 10 commits intomainfrom
Conversation
- only find the repository's first commit date once instead of on every file read - parallelize checking files on startup by moving file.isDir() check to worker thread - headers: only calculate repository root once instead of on every file
|
|
||
| if !dryRun { | ||
| updated, err := licensecheck.UpdateCopyrightHeader(path, targetHolder, configYear, false) | ||
| updated, err := licensecheck.UpdateCopyrightHeader(path, targetHolder, configYear, false, repoFirstYear, repoRoot) |
There was a problem hiding this comment.
Any external code using these exported functions will break. This is a major version bump. Is it possible to mitigate it by adding wrapper functions for backward compatibility.
licensecheck/update.go
Outdated
| continue | ||
| } | ||
| // If it's a 4-digit year | ||
| if year, err := strconv.Atoi(line); err == nil && year > 1900 && year < 2100 { |
There was a problem hiding this comment.
In practice, its rare but the parsing logic could theoretically misidentify a file named with just 4 digits (e.g., 2025) as a commit year. Again this is rare in practice.
If we can be a bit defensive here, we can consider adding the YEAR: prefix to the git log format to eliminate ambiguity. Otherwise, documenting this assumption in a comment would be helpful.
Not a blocker, but worth considering for robustness.
There was a problem hiding this comment.
Didn't think about that edge case but I could definitely see it happening. I added a __CW_YEAR__= prefix to the git log and the parsing logic to resolve this
| once.Do(func() { | ||
| cache, firstYear, err := buildRepositoryCache(repoRoot) | ||
| if err != nil { | ||
| lastCommitYearsCache = make(map[string]int) |
There was a problem hiding this comment.
Let me know your thoughts: When git cache fails, the tool still works but uses lastCommitYear=0 for all files, falling back to config-based years. This is probably fine, but:
- We may need to log it for user (debugging becomes easier)
- Might mask real problems like (git installed but failing, etc)
There was a problem hiding this comment.
This is the logic that was present (see 376 in the pervious version of this file) so I was just maintaining what was there. Happy to add a new log line if that's helpful
| repoFirstYear, _ := licensecheck.GetRepoFirstCommitYear(".") | ||
|
|
||
| // Open git repository once for all file operations | ||
| repoRoot, _ := licensecheck.GetRepoRoot(".") |
There was a problem hiding this comment.
So if we run this from a subdirectory, or across multiple repos, will it still work correctly? Since the cache would only apply to the parent repo. Please let me know if my understanding is correct.
There was a problem hiding this comment.
yeah, GetRepoRoot remains unchanged. So if run within a subdirectory, it's shelling out to git to find the root anyway. This function is not cached.
| ".github/dependabot.yml", | ||
| "**/node_modules/**", | ||
| ".copywrite.hcl", | ||
| ".git/**/*.pack", |
There was a problem hiding this comment.
Is there a reason we shouldn't skip the entire .git folder? I cannot think of a good reason not to.
I'm surprised this was not ignored already to be honest and we accidentally didn't cause some random files there to get updated. 😅
🛠️ Description
filepath.walkDirinstead offilepath.Walk.filepath.walkDircan be faster thanWalkdue to not calling os.Lstat on every file. This allows us to call os.Lstat from the goroutines in parallel instead of always blocking in the main thread. We also avoid calling os.Lstat on files in the ignore patterngit logacross the repository to grab the first commit year and build an index of[fileName] = yearOfLastCommit. This prevents us from callinggit logon every single file. Sincegit log --reversewill traverse the whole git tree anyway, this prevents us from having to iterate the entire git log for every single file..git/**/*.packfiles by defaultImprovements:
On https://github.com/hashicorp/cloud-ui, this:
🔗 External Links
👍 Definition of Done
🤔 Can be merged upon approval?
✅
PCI review checklist
I have documented a clear reason for, and description of, the change I am making.
If applicable, I've documented a plan to revert these changes if they require more than reverting the pull request.
If applicable, I've documented the impact of any changes to security controls.
Examples of changes to security controls include using new access control methods, adding or removing logging pipelines, etc.