Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

meta/autoid: make autoid client ResetConn operation concurrency-safe #50522

Merged
merged 2 commits into from
Jan 19, 2024

Conversation

tiancaiamao
Copy link
Contributor

What problem does this PR solve?

Issue Number: close #50519

Problem Summary:

What changed and how does it work?

The old logic is not concurrency-safe. Although some of the fields are protected by mutex, the logic is not correct.

Imagine that there are 7K concurrent Alloc() operation, and one of them meet rpc error.
Then one ResetConn() is called, which invokes grpcConn.Close().
Note, there are many ongoing calling of Alloc() and still using the conn, so close the grpcConn cause this error:

rpc error: code = Canceled desc = grpc: the client connection is closing

This error in turn cause ResetConn() calling.
Even though GetClient() get a new client, the new client might be reset again by the ResetConn() mistakenly!
So the client can not recover from the error automatically some times (when the concurrent contention is high enough).

To summarize, there are too things we should avoid:

  1. Calling grpcConn.Close() in ResetConn() when the grpcConn might still been using by some other sessions.
  2. ResetConn() could be called multiple times and mistakenly reset the new connection.

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)

Rename this to cmd/autoid/main.go and go run it (modify tidb/pkg/autoid_service/autoid.go and mock 1/1000 percent error rate also).

main.txt

Use the test described in #50519, after the fix, the QPS become stable:

go test -run TestXXX
==== qps ==== 120864
==== qps ==== 143747
==== qps ==== 143514
==== qps ==== 149449
==== qps ==== 144163
==== qps ==== 147491
==== qps ==== 139042
==== qps ==== 144311
==== qps ==== 146520
==== qps ==== 139134
==== qps ==== 144419
==== qps ==== 145943
==== qps ==== 146202
==== qps ==== 137703
==== qps ==== 142836
==== qps ==== 144951
==== qps ==== 142361
==== qps ==== 142149
==== qps ==== 148664
==== qps ==== 143449
==== qps ==== 146752
==== qps ==== 142496

And the error message like this is gone:

rpc error: code = Canceled desc = grpc: the client connection is closing

  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

Make autoid client ResetConn() operation concurrency-safe, get rid of error message like "rpc error: code = Canceled desc = grpc: the client connection is closing" when using AUTO_ID_CACHE=1

@ti-chi-bot ti-chi-bot bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/needs-triage-completed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jan 17, 2024
Copy link

tiprow bot commented Jan 17, 2024

Hi @tiancaiamao. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

func (d *ClientDiscover) ResetConn(reason error) {
func (d *ClientDiscover) ResetConn(version uint64, reason error) {
// Avoid repeated Reset operation
if !atomic.CompareAndSwapUint64(&d.version, version, version+1) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Introduce a version variable to avoid repeated ResetConn() and this also avoid a stale ResetConn() reset the newly created grpcConn mistakenly.

go func() {
// Doen't close the conn immediately, in case the other sessions are still using it.
time.Sleep(200 * time.Millisecond)
err := grpcConn.Close()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Delay the Close() operation, so the other session that still reference this grpcConn do not get the "grpc: the client connection is closing" error.

@ti-chi-bot ti-chi-bot bot added needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. and removed do-not-merge/needs-triage-completed labels Jan 18, 2024
Copy link

codecov bot commented Jan 18, 2024

Codecov Report

Merging #50522 (d331238) into master (245951f) will decrease coverage by 16.1956%.
Report is 27 commits behind head on master.
The diff coverage is 16.0000%.

Additional details and impacted files
@@                Coverage Diff                @@
##             master     #50522         +/-   ##
=================================================
- Coverage   71.9131%   55.7175%   -16.1956%     
=================================================
  Files          1449       1564        +115     
  Lines        347105     588757     +241652     
=================================================
+ Hits         249614     328041      +78427     
- Misses        77205     237861     +160656     
- Partials      20286      22855       +2569     
Flag Coverage Δ
integration 36.9161% <16.0000%> (?)
unit 70.3109% <16.0000%> (-1.6022%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 54.0269% <ø> (-2.2860%) ⬇️
parser ∅ <ø> (∅)
br 56.6142% <ø> (+5.3109%) ⬆️

@tiancaiamao
Copy link
Contributor Author

/test mysql-test

Copy link

tiprow bot commented Jan 18, 2024

@tiancaiamao: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test mysql-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@tiancaiamao
Copy link
Contributor Author

/test mysql-test

Copy link

tiprow bot commented Jan 18, 2024

@tiancaiamao: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/test mysql-test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Jan 18, 2024
Copy link
Member

@bb7133 bb7133 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Jan 19, 2024
Copy link

ti-chi-bot bot commented Jan 19, 2024

[LGTM Timeline notifier]

Timeline:

  • 2024-01-18 12:46:57.718512631 +0000 UTC m=+446459.282810326: ☑️ agreed by xhebox.
  • 2024-01-19 02:21:23.031171427 +0000 UTC m=+495324.595469132: ☑️ agreed by bb7133.

Copy link

ti-chi-bot bot commented Jan 19, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bb7133, xhebox, zimulala

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Jan 19, 2024
@@ -56,6 +57,8 @@ type ClientDiscover struct {
// See https://github.com/grpc/grpc-go/issues/5321
*grpc.ClientConn
}
// version is increased in every ResetConn() to make the operation safe.
version uint64
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use atomic.Uint64 ?

@ti-chi-bot ti-chi-bot bot merged commit d8298d5 into pingcap:master Jan 19, 2024
20 checks passed
ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Jan 19, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #50592.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #50605.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.6: #50621.

@ti-chi-bot ti-chi-bot added the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Feb 6, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #51013.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Feb 6, 2024
Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. needs-cherry-pick-release-7.6 release-note Denotes a PR that will be considered when it comes time to generate release notes. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

autoid client reset connection operation is not concurrency-safe
6 participants