Skip to content

[release-7.4] Fix some potential DatabaseContext leaks in NativeApi #12308

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 18, 2025

Conversation

vishesh
Copy link
Contributor

@vishesh vishesh commented Aug 15, 2025

  1. Only start clientStatusUpdateActor when DatabaseContext successfully established connection to the cluster.

  2. DatabaseContext starts few actors for monitoring as well as update client status to server. These sometimes pass pointers to DatabaseContext and the Transaction object created within these actors will increment the refcount. This can lead to cyclic references or holding DatabaseContext objects for long period of times if Transaction object is not cleaned up or we keep retrying forever without any limit.

  3. Some actors don't handle actor_cancelled errors which can lead then to stay alive forever. This patch fixes some of those.

rdar://155780163 found that in multi-version client if primary is incompatible with the cluster, we keep trying to reconnect with cluster, and spamming with IncompatibleConnectionClosed messages. (1) should be enough to fix that, but other issues were found during investigation which can potentially lead to similar issues in future.

Testing:

Manually started a 7.1 cluster, with 7.4 primary client and 7.1. secondary client. Started a client. Without this patch we'll see bunch of IncompatibleConnectionClosed messages, and with this patch they will be gone.

Cherry-Pick: #12304
Joshua: 20250815-215451-vishesh-2b90562fb390f691 pass=100000 remaining=0

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@vishesh vishesh requested a review from spraza as a code owner August 15, 2025 21:29
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: e22a3d1
  • Duration 0:40:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: e22a3d1
  • Duration 0:47:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: e22a3d1
  • Duration 0:58:31
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: e22a3d1
  • Duration 1:04:39
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: e22a3d1
  • Duration 1:06:46
  • Result: ❌ FAILED
  • Error: Error while executing command: if python3 -m joshua.joshua list --stopped | grep ${ENSEMBLE_ID} | grep -q 'pass=10[0-9][0-9][0-9]'; then echo PASS; else echo FAIL && exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@vishesh vishesh requested a review from kakaiu August 15, 2025 22:37
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: e22a3d1
  • Duration 1:12:51
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: 4d92a50
  • Duration 0:39:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: 4d92a50
  • Duration 0:47:22
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: 4d92a50
  • Duration 1:03:01
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: 4d92a50
  • Duration 1:03:37
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: 4d92a50
  • Duration 1:05:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: 4d92a50
  • Duration 1:11:53
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

if (e.code() == error_code_actor_cancelled || retries++ >= 10) {
throw;
}

wait(tr->onError(e));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What error do we retry here? Thanks!

Copy link
Contributor Author

@vishesh vishesh Aug 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the transaction fails for any reasons, say for getRange / commit etc. Earlier it would loop forever retrying. Its undesirable, we can easily be in a situation where the commit will never succeed (say the server upgraded), and the infinite loop means transaction that we take from parent actor will keep holding DatabaseContext reference which will in turn hold reference to parent actor leading to a cycle and no cleanup.

1. Only start `clientStatusUpdateActor` when `DatabaseContext` successfully established connection
to the cluster.

2. `DatabaseContext` starts few actors for monitoring as well as update client status to server.
These sometimes pass pointers to `DatabaseContext` and the `Transaction` object created within these
actors will increment the refcount. This can lead to cyclic references or holding `DatabaseContext`
objects for long period of times if `Transaction` object is not cleaned up or we keep retrying
forever without any limit.

3. Some actors don't handle `actor_cancelled` errors which can lead then to stay alive forever. This
patch fixes some of those.

rdar://155780163 found that in multi-version client if primary is incompatible with the cluster, we
keep trying to reconnect with cluster, and spamming with IncompatibleConnectionClosed messages. (1)
should be enough to fix that, but other issues were found during investigation which can potentially
lead to similar issues in future.

Testing:

Manually started a 7.1 cluster, with 7.4 primary client and 7.1. secondary client. Started a client.
Without this patch we'll see bunch of IncompatibleConnectionClosed messages, and with this patch
they will be gone.
@kakaiu kakaiu self-requested a review August 18, 2025 17:42
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x

  • Commit ID: ae031e1
  • Duration 0:40:02
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: ae031e1
  • Duration 0:47:45
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-macos on macOS Ventura 13.x

  • Commit ID: ae031e1
  • Duration 1:02:13
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: ae031e1
  • Duration 1:08:41
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: ae031e1
  • Duration 1:11:25
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: ae031e1
  • Duration 1:13:33
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@vishesh vishesh merged commit a4c1786 into apple:release-7.4 Aug 18, 2025
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants