-
Notifications
You must be signed in to change notification settings - Fork 1.4k
[release-7.4] Fix some potential DatabaseContext leaks in NativeApi #12308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
if (e.code() == error_code_actor_cancelled || retries++ >= 10) { | ||
throw; | ||
} | ||
|
||
wait(tr->onError(e)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What error do we retry here? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the transaction fails for any reasons, say for getRange
/ commit
etc. Earlier it would loop forever retrying. Its undesirable, we can easily be in a situation where the commit will never succeed (say the server upgraded), and the infinite loop means transaction that we take from parent actor will keep holding DatabaseContext reference which will in turn hold reference to parent actor leading to a cycle and no cleanup.
1. Only start `clientStatusUpdateActor` when `DatabaseContext` successfully established connection to the cluster. 2. `DatabaseContext` starts few actors for monitoring as well as update client status to server. These sometimes pass pointers to `DatabaseContext` and the `Transaction` object created within these actors will increment the refcount. This can lead to cyclic references or holding `DatabaseContext` objects for long period of times if `Transaction` object is not cleaned up or we keep retrying forever without any limit. 3. Some actors don't handle `actor_cancelled` errors which can lead then to stay alive forever. This patch fixes some of those. rdar://155780163 found that in multi-version client if primary is incompatible with the cluster, we keep trying to reconnect with cluster, and spamming with IncompatibleConnectionClosed messages. (1) should be enough to fix that, but other issues were found during investigation which can potentially lead to similar issues in future. Testing: Manually started a 7.1 cluster, with 7.4 primary client and 7.1. secondary client. Started a client. Without this patch we'll see bunch of IncompatibleConnectionClosed messages, and with this patch they will be gone.
Result of foundationdb-pr-macos-m1 on macOS Ventura 13.x
|
Result of foundationdb-pr-clang-arm on Linux CentOS 7
|
Result of foundationdb-pr-macos on macOS Ventura 13.x
|
Result of foundationdb-pr on Linux RHEL 9
|
Result of foundationdb-pr-cluster-tests on Linux RHEL 9
|
Result of foundationdb-pr-clang on Linux RHEL 9
|
Only start
clientStatusUpdateActor
whenDatabaseContext
successfully established connection to the cluster.DatabaseContext
starts few actors for monitoring as well as update client status to server. These sometimes pass pointers toDatabaseContext
and theTransaction
object created within these actors will increment the refcount. This can lead to cyclic references or holdingDatabaseContext
objects for long period of times ifTransaction
object is not cleaned up or we keep retrying forever without any limit.Some actors don't handle
actor_cancelled
errors which can lead then to stay alive forever. This patch fixes some of those.rdar://155780163 found that in multi-version client if primary is incompatible with the cluster, we keep trying to reconnect with cluster, and spamming with IncompatibleConnectionClosed messages. (1) should be enough to fix that, but other issues were found during investigation which can potentially lead to similar issues in future.
Testing:
Manually started a 7.1 cluster, with 7.4 primary client and 7.1. secondary client. Started a client. Without this patch we'll see bunch of IncompatibleConnectionClosed messages, and with this patch they will be gone.
Cherry-Pick: #12304
Joshua: 20250815-215451-vishesh-2b90562fb390f691 pass=100000 remaining=0
Code-Reviewer Section
The general pull request guidelines can be found here.
Please check each of the following things and check all boxes before accepting a PR.
For Release-Branches
If this PR is made against a release-branch, please also check the following:
release-branch
ormain
if this is the youngest branch)