Skip to content

[release-7.3] Fix some potential DatabaseContext leaks in NativeApi #12309

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 18, 2025

Conversation

vishesh
Copy link
Contributor

@vishesh vishesh commented Aug 15, 2025

  1. Only start clientStatusUpdateActor when DatabaseContext successfully established connection to the cluster.

  2. DatabaseContext starts few actors for monitoring as well as update client status to server. These sometimes pass pointers to DatabaseContext and the Transaction object created within these actors will increment the refcount. This can lead to cyclic references or holding DatabaseContext objects for long period of times if Transaction object is not cleaned up or we keep retrying forever without any limit.

  3. Some actors don't handle actor_cancelled errors which can lead then to stay alive forever. This patch fixes some of those.

rdar://155780163 found that in multi-version client if primary is incompatible with the cluster, we keep trying to reconnect with cluster, and spamming with IncompatibleConnectionClosed messages. (1) should be enough to fix that, but other issues were found during investigation which can potentially lead to similar issues in future.

Testing:

Manually started a 7.1 cluster, with 7.4 primary client and 7.1. secondary client. Started a client. Without this patch we'll see bunch of IncompatibleConnectionClosed messages, and with this patch they will be gone.

Code-Reviewer Section

The general pull request guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or main if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

1. Only start `clientStatusUpdateActor` when `DatabaseContext` successfully established connection
to the cluster.

2. `DatabaseContext` starts few actors for monitoring as well as update client status to server.
These sometimes pass pointers to `DatabaseContext` and the `Transaction` object created within these
actors will increment the refcount. This can lead to cyclic references or holding `DatabaseContext`
objects for long period of times if `Transaction` object is not cleaned up or we keep retrying
forever without any limit.

3. Some actors don't handle `actor_cancelled` errors which can lead then to stay alive forever. This
patch fixes some of those.

rdar://155780163 found that in multi-version client if primary is incompatible with the cluster, we
keep trying to reconnect with cluster, and spamming with IncompatibleConnectionClosed messages. (1)
should be enough to fix that, but other issues were found during investigation which can potentially
lead to similar issues in future.

Testing:

Manually started a 7.1 cluster, with 7.4 primary client and 7.1. secondary client. Started a client.
Without this patch we'll see bunch of IncompatibleConnectionClosed messages, and with this patch
they will be gone.
@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang-arm on Linux CentOS 7

  • Commit ID: c034183
  • Duration 0:04:45
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-clang on Linux RHEL 9

  • Commit ID: c034183
  • Duration 0:04:50
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr-cluster-tests on Linux RHEL 9

  • Commit ID: c034183
  • Duration 0:06:57
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)
  • Cluster Test Logs zip file of the test logs (available for 30 days)

@foundationdb-ci
Copy link
Contributor

Result of foundationdb-pr on Linux RHEL 9

  • Commit ID: c034183
  • Duration 0:07:07
  • Result: ❌ FAILED
  • Error: Error while executing command: ninja -v -C build_output -j ${NPROC} all packages strip_targets. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@jzhou77 jzhou77 merged commit 6401fa4 into apple:release-7.3 Aug 18, 2025
0 of 4 checks passed
jzhou77 pushed a commit to jzhou77/foundationdb that referenced this pull request Aug 20, 2025
1. Only start `clientStatusUpdateActor` when `DatabaseContext` successfully established connection
to the cluster.

2. `DatabaseContext` starts few actors for monitoring as well as update client status to server.
These sometimes pass pointers to `DatabaseContext` and the `Transaction` object created within these
actors will increment the refcount. This can lead to cyclic references or holding `DatabaseContext`
objects for long period of times if `Transaction` object is not cleaned up or we keep retrying
forever without any limit.

3. Some actors don't handle `actor_cancelled` errors which can lead then to stay alive forever. This
patch fixes some of those.

rdar://155780163 found that in multi-version client if primary is incompatible with the cluster, we
keep trying to reconnect with cluster, and spamming with IncompatibleConnectionClosed messages. (1)
should be enough to fix that, but other issues were found during investigation which can potentially
lead to similar issues in future.

Testing:

Manually started a 7.1 cluster, with 7.4 primary client and 7.1. secondary client. Started a client.
Without this patch we'll see bunch of IncompatibleConnectionClosed messages, and with this patch
they will be gone.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants