Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[pkg/stanza/fileconsumer] Optimize file finding logic #35895

Closed
wants to merge 2 commits into from

Conversation

djaglowski
Copy link
Member

This PR migrates us from doublestar.FilepathGlob to doublestar.GlobWalk. There are a few benefits to this:

  1. Pattern validation, cleaning, and parsing is pulled forward so that it only needs to be performed once at startup. This accounts for the bulk of the performance improvement.
  2. We can make use of a "walk function" where we can handle each element earlier in the process. Currently, this is only used to cache and immediately recall previous decisions about file paths, but in the future it may be useful for filtering. e.g. The "max age" filter could be applied here, rather than after the list of files are returned.

Benchmarks show roughly a 5% decrease in CPU cycles and a 20% decrease in memory consumption.

goos: darwin
goarch: arm64
pkg: github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/fileconsumer/matcher/internal/finder
cpu: Apple M1 Max
                       │ pkg/stanza/main-finder.txt │   pkg/stanza/finder-optimize.txt   │
                       │           sec/op           │   sec/op     vs base               │
Find10kFiles-10                         12.40m ± 1%   11.47m ± 4%  -7.44% (p=0.000 n=10)
Find10kFilesComplex-10                  2.887m ± 2%   2.784m ± 2%  -3.57% (p=0.000 n=10)
geomean                                 5.982m        5.652m       -5.53%

                       │ pkg/stanza/main-finder.txt │    pkg/stanza/finder-optimize.txt    │
                       │            B/op            │     B/op      vs base                │
Find10kFiles-10                        5.581Mi ± 0%   4.395Mi ± 0%  -21.25% (p=0.000 n=10)
Find10kFilesComplex-10                 661.0Ki ± 0%   526.0Ki ± 0%  -20.43% (p=0.000 n=10)
geomean                                1.898Mi        1.502Mi       -20.84%

                       │ pkg/stanza/main-finder.txt │   pkg/stanza/finder-optimize.txt    │
                       │         allocs/op          │  allocs/op   vs base                │
Find10kFiles-10                         80.25k ± 0%   70.21k ± 0%  -12.51% (p=0.000 n=10)
Find10kFilesComplex-10                  9.437k ± 0%   8.341k ± 0%  -11.61% (p=0.000 n=10)
geomean                                 27.52k        24.20k       -12.06%

@djaglowski djaglowski added Run Windows Enable running windows test on a PR Run Darwin labels Oct 21, 2024
Copy link
Contributor

github-actions bot commented Nov 5, 2024

This PR was marked stale due to lack of activity. It will be closed in 14 days.

@github-actions github-actions bot added the Stale label Nov 5, 2024
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Nov 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg/stanza Run Darwin Run Windows Enable running windows test on a PR Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant