Skip to content

Reduce frequency of two categories of Sev30s #12310

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 20, 2025

Conversation

spraza
Copy link
Collaborator

@spraza spraza commented Aug 19, 2025

Description

As part of 7.4 qualification, we noticed two categories of frequent Sev30's:

  1. Type=StorageServerStatusJson with MissingAttribute=BytesStored. This can happen if storage server sends an incomplete metrics response to cluster controller as part of status json generation. The fix here is to reduce the frequency of this trace event. Currently, it's suppressed for 5 sec.

  2. Type=SuppressionFromNonNetworkThread with Event=TLSPolicySuccess. This is interesting because here we have a non-suppressable verbose event that gets triggered when we try to suppress another event. It's also interesting because the intent seems to be that we do not expect any suppressions from a non-fdbmain thread. The intent makes sense if the non-fdbmain threads are not doing much, because if they are, by law of probability, we'd want to suppress some events, which we normally do. The intent has lived in the codebase for multiple years, before TLS. Post-TLS, the non-fdbmain threads are doing more work, and we should allow suppressions. So the fix here is to just remove the SuppressionFromNonNetworkThread event and simplify the surrounding logic a bit.

Both these changes will be documented in 7.4 release notes so anyone relying on these are aware before they upgrade.

I will also cherry-pick this change to the main branch.

Testing

100K: 20250818-214742-praza-trace-bugs-v1-07e54a8-fd58a76001faf269 compressed=True data_size=40158474 duration=5038760 ended=100000 fail=1 fail_fast=10 max_runs=100000 pass=99999 priority=100 remaining=0 runtime=0:57:37 sanity=False started=100000 stopped=20250818-224519 submitted=20250818-214742 timeout=5400 username=praza-trace-bugs-v1-07e54a89702c9aef8946f937030bf8a4acd38e46

The 1 failure is in ParallelRestoreOldBackupCorrectnessAtomicOp.toml and is not related to this change.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 07e54a8
  • Duration 0:38:58
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 07e54a8
  • Duration 0:49:04
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 07e54a8
  • Duration 0:57:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 07e54a8
  • Duration 1:02:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 07e54a8
  • Duration 1:06:36
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 07e54a8
  • Duration 1:10:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@spraza spraza merged commit 6198f1d into apple:release-7.4 Aug 20, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants