Reduce frequency of two categories of Sev30s #12310
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
As part of 7.4 qualification, we noticed two categories of frequent Sev30's:
Type=StorageServerStatusJson
withMissingAttribute=BytesStored
. This can happen if storage server sends an incomplete metrics response to cluster controller as part of status json generation. The fix here is to reduce the frequency of this trace event. Currently, it's suppressed for 5 sec.Type=SuppressionFromNonNetworkThread
withEvent=TLSPolicySuccess
. This is interesting because here we have a non-suppressable verbose event that gets triggered when we try to suppress another event. It's also interesting because the intent seems to be that we do not expect any suppressions from a non-fdbmain thread. The intent makes sense if the non-fdbmain threads are not doing much, because if they are, by law of probability, we'd want to suppress some events, which we normally do. The intent has lived in the codebase for multiple years, before TLS. Post-TLS, the non-fdbmain threads are doing more work, and we should allow suppressions. So the fix here is to just remove theSuppressionFromNonNetworkThread
event and simplify the surrounding logic a bit.Both these changes will be documented in 7.4 release notes so anyone relying on these are aware before they upgrade.
I will also cherry-pick this change to the main branch.
Testing
100K:
20250818-214742-praza-trace-bugs-v1-07e54a8-fd58a76001faf269 compressed=True data_size=40158474 duration=5038760 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:57:37 sanity=False started=100000 stopped=20250818-224519 submitted=20250818-214742 timeout=5400 username=praza-trace-bugs-v1-07e54a89702c9aef8946f937030bf8a4acd38e46
The 1 failure is in
ParallelRestoreOldBackupCorrectnessAtomicOp.toml
and is not related to this change.Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)