Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fileconsumer] Deduplicate fingerprints less aggressively #24235

Merged
merged 2 commits into from
Jul 17, 2023

Conversation

djaglowski
Copy link
Member

@djaglowski djaglowski commented Jul 12, 2023

When finding files, we would check for files with duplicate fingerprints and deduplicate them if a.StartsWith(b) or b.StartsWith(a). This StartsWith logic is useful for recognizing files that have new content since the previous poll interval. However, when deduplicating files observed at the same moment, the only case that we need to consider is copy/truncate rotation. In this case, a file may have been copied but the original has not yet been truncated. At that moment we should expect to find two files with exactly the same fingerprint. Therefore, we do not need to check StartsWith cases.

@djaglowski djaglowski added the Run Windows Enable running windows test on a PR label Jul 12, 2023
@djaglowski djaglowski force-pushed the strict-dedup branch 2 times, most recently from 0f83c6a to b23466d Compare July 13, 2023 15:57
@djaglowski djaglowski marked this pull request as ready for review July 13, 2023 19:48
@djaglowski djaglowski requested review from a team and atoulme July 13, 2023 19:48
Copy link
Contributor

@atoulme atoulme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible it'd be great to just have some coverage in a unit test.

When finding files, we would check for files with duplicate fingerprints
and deduplicate them if a.StartsWith(b) or b.StartsWith(a). This StartsWith
logic is useful for recognizing files that have new content since the previous
poll interval. However, when deduplicating the results of the poll interval,
the only case that we need to consider is copy/truncate rotation. In this case,
a file may have been copied but the original has not yet been trucated. At that
moment we should expect to find two files with exactly the same fingerprint.
Therefore, we do not need to check StartsWith cases.
@djaglowski djaglowski merged commit 84df0e6 into open-telemetry:main Jul 17, 2023
@djaglowski djaglowski deleted the strict-dedup branch July 17, 2023 14:43
@github-actions github-actions bot added this to the next release milestone Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg/stanza Run Windows Enable running windows test on a PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants