Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[chore][pkg/stanza] Speed up file deduplication in finder (open-telem…
…etry#34888) **Description:** <Describe what has changed.> For large numbers of files, the logic that deduplicates the filenames between matches is costly. This is mainly due to the O(n^2) deduping algorithm used. If we instead use a map (as a hashset), we can make this ~O(n). This PR speeds up the deduplication logic, as well as adds a benchmark for a case where the filelog receiver is polling many files at once. **Testing:** <Describe what testing was performed and which tests were added.> Running the added benchmark and comparing with benchstat, we can see a large increase in speed for the large number of files case (10000 monitored files), at the cost of a very slight increase in memory usage: ``` goos: darwin goarch: arm64 pkg: github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/matcher/internal/finder cpu: Apple M3 Pro │ old.txt │ new.txt │ │ sec/op │ sec/op vs base │ Find10kFiles-12 198.636m ± 6% 8.696m ± 16% -95.62% (p=0.002 n=6) │ old.txt │ new.txt │ │ B/op │ B/op vs base │ Find10kFiles-12 5.416Mi ± 0% 5.581Mi ± 0% +3.04% (p=0.002 n=6) │ old.txt │ new.txt │ │ allocs/op │ allocs/op vs base │ Find10kFiles-12 80.06k ± 0% 80.25k ± 0% +0.23% (p=0.002 n=6) ```
- Loading branch information