server: enhance graceful stop by closing connections after finish the ongoing txn #32111

july2993 · 2022-02-06T11:57:01Z

What problem does this PR solve?

Issue Number: close #32110

Problem Summary:
enhance graceful stop by closing the connection after finishing the ongoing txn

What is changed and how it works?

I tested the transaction failure caused by rolling updateof tidb, and found that basically all the transactions currently running will fail.

How to test

k apply -f tc.yaml deploys a cluster with version v5.4.0-pre(3 tidb, 1 tikv, 1 pd), the spec is as follows:

Click to expand!

apiVersion: pingcap.com/v1alpha1
kind: TidbCluster
metadata:
  name: test
spec:
  version: v5.4.0-pre
  timezone: Asia/Kolkata
  pvReclaimPolicy: Delete
  enableDynamicConfiguration: true
  configUpdateStrategy: RollingUpdate
  discovery: {}
  helper:
    image: busybox:1.34.1
  pd:
    baseImage: pingcap/pd
    maxFailoverCount: 0
    enableDashboardInternalProxy: true
    service:
      type: ClusterIP
      clusterIP: "None"
    replicas: 1
    # if storageClassName is not set, the default Storage Class of the Kubernetes cluster will be used
    # storageClassName: local-storage
    requests:
      storage: "1Gi"
    config: {}
  tikv:
    baseImage: pingcap/tikv
    maxFailoverCount: 0
    evictLeaderTimeout: 1m
    replicas: 1
    requests:
      storage: "10Gi"
    config:
      storage:
        reserve-space: "0MB"
  tidb:
    # just update it to trigger rolling update
    annotations:
      kk: v9
    baseImage: pingcap/tidb
    maxFailoverCount: 0
    replicas: 3
    service:
      type: ClusterIP
    storageVolumes:
      - name: log
        # storageClassName: ${storageClass}
        storageSize: "10Gi"
        mountPath: /var/tidb/log
    config: |
      graceful-wait-before-shutdown=10
      [log]
      level = "debug"
      [log.file]
      filename = "/var/tidb/log/tidb.log"

Write a simple test program with n connections concurrently, running transactions continuously, and simply running two replace statements per transaction. The code is here.

k apply test.yaml to run a pod to run the test program:

apiVersion: v1
kind: Pod
metadata:
  name: test
spec:
  containers:
  - name: write
    image: july2993/tk:latest
    command:
      - "/tk"
      - "test"
      - "--host"
      - "test-tidb"
      - "--max-connection"
      - "1000"

Just modify the annotations of tidb spec to trigger rolling update of tidb

...
 tidb:
    # just update it to trigger rolling update
    annotations:
      kk: v99
 ...

The number of failed transactions after restarting three tidbs is 1275

...
[mysql] 2022/02/05 07:00:13 packets.go:123: closing bad idle connection: EOF
[mysql] 2022/02/05 07:00:13 packets.go:37: read tcp 10.244.0.30:56934->10.96.1.38:4000: read: connection reset by peer
[mysql] 2022/02/05 07:00:13 packets.go:123: closing bad idle connection: EOF
...
[2022/02/05 07:00:13.425 +00:00] [ERROR] [test.go:81] ["failed to run txn"] [error="failed to begin txn: invalid connection"]
[mysql] 2022/02/05 07:00:13 packets.go:37: read tcp 10.244.0.30:57754->10.96.1.38:4000: read: connection reset by peer
...
[2022/02/05 07:02:37.508 +00:00] [ERROR] [test.go:81] ["failed to run txn"] [error="failed to exec statement: invalid connection"] [stack= "github.com/july2993/tk/cmd.glob..func1.2\n\t/go/src/github.com/july2993/tk/cmd/test.go:81"]
...
totalCount: 1030993, failCount: 1275, totalCountDiff: 2576, failCountDiff: 0

Note that some of the errors are reported during begin, and some are reported during exec statement.

why it failed

update at 2023.03.06:
After #37441, we will not run into the TryGracefulDown case, instead, after deletionGracePeriodSeconds(default 30s), kubelet will send SIGKILL to stop tidb.

To understand why it fails, we need to understand how tidb handles exit and go sql driver.

How tidb handles exits,

Let's take a look at the code that handles signal exit:

func main() {
  //... omit some code
	exited := make(chan struct{})
	signal.SetupSignalHandler(func(graceful bool) {
		svr.Close()
		cleanup(svr, storage, dom, graceful)
		cpuprofile.StopCPUProfiler()
		close(exited)
	})
	topsql.SetupTopSQL()
	terror.MustNil(svr.Run())
	<-exited
	syncLog()
}

Here graceful will be true only if the signal is QUIT (unclear why), we can ignore only the false case for now. Because we willsend SIGTERM to stop tidb, and then sent SIGKILL directly after a certain period of time.

svr.Close() mainly does the following work (code):

Set inShutdownMode to true, and wait for s.cfg.GracefulWaitBeforeShutdown, the purpose is to let LB discover and remove the tidb first.
Close all Listeners to reject the new connection.

cleanup() mainly looks at the last called GracefulDown():

// TryGracefulDown TryGracefulDown will try to gracefully close all connection first with timeout. if timeout, will close all connection directly.
func (s *Server) TryGracefulDown() {
    ctx, cancel := context.WithTimeout(context.Background(), gracefulCloseConnectionsTimeout)
    defer cancel()
    done := make(chan struct{})
    go func() {
        s.GracefulDown(ctx, done)
    }()
    select {
    case <-ctx.Done():
        s.KillAllConnections()
    case <-done:
        return
    }
}

func (s *Server) GracefulDown(ctx context.Context, done chan struct{}) {
    logutil.Logger(ctx).Info("[server] graceful shutdown.")
    metrics.ServerEventCounter.WithLabelValues(metrics.EventGracefulDown).Inc()

    count := s.ConnectionCount()
    for i := 0; count > 0; i++ {
        s.kickIdleConnection()

        count = s.ConnectionCount()
        if count == 0 {
            break
        }
        // Print information for every 30s.
        if i%30 == 0 {
            logutil.Logger(ctx).Info("graceful shutdown...", zap.Int("conn count", count))
        }
        ticker:= time.After(time.Second)
        select {
        case <-ctx.Done():
            return
        case <-ticker:
        }
    }
    close(done)
}

s.kickIdleConnection() in GracefulDown() mainly scans s.clients (all connections maintained), if the current connection is not in a transaction, it will close the connection. But note that it is checked every second, if a connection is actively running transactions, it may not close it after checking many times. After the gracefulCloseConnectionsTimeout(15s), it will directly close the connection regardless of the current state of the connection. The errors reported in the exec statement mentioned above are all closed directly here.

go sql driver

Here we use the driver https://github.com/go-sql-driver/mysql. The driver implementation does not manage the connection pool itself, but the database/sql package of go. The driver implements the some interfaces in database/sql/driver package. The implementation tells the sql package that the connection status is invalid by returning driver.ErrBadConn (for example, the server side closes the connection), and you need to retry with a new connection.

The check logic of go-sql-driver/mysql connection is mainly in conncheck.go, refer to pr924. The main thing to do is to read the connection in non-blocking way when the statement is executed for the first time getting from the connection pool. If no data can be read and err is syscall.EAGAIN or syscall.EWOULDBLOCK, this connection is Normal, otherwise ErrBadConn is returned. In the test, part of the transaction that fails to run begin is because the client side has not yet learn that our server is about to close or has closed the connection, and then it failed when running the "START TRANSACTION" statement with the connection.

Reason summary

You can see that tidb tries to close the connection between transactions (or close the connection when it is idle). One type of failure is the race between the server-side close connection and the client checking the connection status. One type of failure is caused by tidb closing all connection directly after trying to close all connections in the gracefulCloseConnectionsTimeout (the part where the exec statement fails).

Optimize

Each connection will have a gorouting run func (cc *clientConn) Run(ctx context.Context). What the Run function does is to keep reading a packet (here refers to mysql protocol packet), and then process the packet. Here we can change it to let the Run of this clientConn find out that it is currently shutting down and then choose to close the connection at the right time (that is, between transactions), instead of relying on external timing checks. In this way, we can close each connection immediately after running the current transaction, and there will be no forced close after timeout. However, in the case of begin failure, because there is a race, when we close the connection, the client may even have used this connection to send the statement to open the transaction and sent it successfully at TCP level. There is nothing we can do about this part.

After modification and retest, failCount: 642, the failure is less than half of the previous one. In order to test the error situation when running transactions that are not very active, I changed the test program to let each goroutine take a connection in the DB to run a transaction and sleep for one second, and then retest:

version	failCount
v5.4.0-pre	57
Optimized version	0

It can be seen that the optimized version does not have any failures (there is no guarantee that there will be no failures).

A further possible optimization is to close the tcp read side before processing the commit statement on the server side (ref shutdown TCPConn.CloseREAD), if the driver's connCheck is implemented to check whether the connection is close read, then theoretically for our test program, there will be no failure when rolling update tidb。

Summarize

The optimized version of tidb can reduce the client-side failure caused by restarting tidb when using connection pool, close the connection after finishing the ongoing ten, and will not directly close the connection after exceeding the gracefulCloseConnectionsTimeout (assume that the transaction itself does not run for more than the gracefulCloseConnectionsTimeout).

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Side effects

Performance regression: Consumes more CPU
Performance regression: Consumes more Memory
Breaking backward compatibility

Documentation

Release note

None

ti-chi-bot · 2022-02-06T11:57:02Z

[REVIEW NOTIFICATION]

This pull request has been approved by:

hawkingrei
tiancaiamao

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

sre-bot · 2022-02-06T11:58:50Z

Please follow PR Title Format:

pkg [, pkg2, pkg3]: what is changed

Or if the count of mainly changed packages are more than 3, use

*: what is changed

After you have format title, you can leave a comment /run-check_title to recheck it

sre-bot · 2022-02-07T01:42:06Z

Please follow PR Title Format:

pkg [, pkg2, pkg3]: what is changed

Or if the count of mainly changed packages are more than 3, use

*: what is changed

After you have format title, you can leave a comment /run-check_title to recheck it

server/server.go

Dockerfile

server/conn.go

When graceful, keep waiting all connection drained. When not, try 15s like before. Note after pingcap#37441 the "graceful" or not checking is changed.

july2993 · 2023-03-06T10:51:08Z

/retest

hawkingrei · 2023-03-07T10:46:37Z

/retest

july2993 · 2023-03-07T11:55:47Z

/merge

ti-chi-bot · 2023-03-07T11:55:51Z

This pull request has been accepted and is ready to merge.

Commit hash: 1d62710

ti-chi-bot · 2023-11-27T02:14:11Z

In response to a cherrypick label: new pull request created to branch release-6.5: #48905.

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

… ongoing txn (#32111) (#48905) close #32110

… ongoing txn (pingcap#32111) (pingcap#48905) close pingcap#32110

… ongoing txn (#32111) (#48905) (#48989) close #32110 Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

…nish the ongoing txn (pingcap#32111) (pingcap#48905)" This reverts commit b504f0b.

… ongoing txn (pingcap#32111) (pingcap#48905) (pingcap#29) close pingcap#32110 Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

july2993 added 4 commits February 4, 2022 01:13

v1

fd8beef

v2

fa9197a

cleanup

c3308a2

revert Shutdown&WaitShutdown

15ea902

ti-chi-bot added release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 6, 2022

Merge branch 'master' into stop

50ae929

hawkingrei reviewed Feb 7, 2022

View reviewed changes

server/server.go Outdated Show resolved Hide resolved

ti-chi-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 8, 2022

bb7133 requested a review from tiancaiamao March 2, 2023 14:15

tiancaiamao reviewed Mar 6, 2023

View reviewed changes

Dockerfile Outdated Show resolved Hide resolved

server/conn.go Show resolved Hide resolved

july2993 added 3 commits March 6, 2023 14:55

Merge branch 'master' into stop

959a5b7

Keep the "graceful" behavior

9e086b9

When graceful, keep waiting all connection drained. When not, try 15s like before. Note after pingcap#37441 the "graceful" or not checking is changed.

Merge remote-tracking branch 'july2993/stop' into stop

9f35e8f

ti-chi-bot added the do-not-merge/invalid-title label Mar 6, 2023

fix build

557c6c9

july2993 added 2 commits March 6, 2023 19:12

Run "make bazel_prepare"

1a26c7b

Fix lint

acc78c9

ti-chi-bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Mar 6, 2023

july2993 added 2 commits March 6, 2023 20:44

change mode

49eaee2

Close lisener in server.Close

1d62710

july2993 requested review from hawkingrei and tiancaiamao March 6, 2023 14:54

july2993 changed the title ~~enhance graceful stop by closing connections after finish the ongoing txn~~ server: enhance graceful stop by closing connections after finish the ongoing txn Mar 7, 2023

ti-chi-bot removed the do-not-merge/invalid-title label Mar 7, 2023

tiancaiamao approved these changes Mar 7, 2023

View reviewed changes

ti-chi-bot added the status/LGT1 Indicates that a PR has LGTM 1. label Mar 7, 2023

hawkingrei approved these changes Mar 7, 2023

View reviewed changes

ti-chi-bot added status/LGT2 Indicates that a PR has LGTM 2. and removed status/LGT1 Indicates that a PR has LGTM 1. labels Mar 7, 2023

ti-chi-bot added the status/can-merge Indicates a PR has been approved by a committer. label Mar 7, 2023

ti-chi-bot merged commit 0d7e7b8 into pingcap:master Mar 7, 2023

july2993 mentioned this pull request Mar 13, 2023

server: Set default drainClientWait as 15s #42134

Merged

12 tasks

This was referenced Jun 26, 2023

Exit tidb-server directly when all connections not in txn #44951

Closed

server: exit tidb-server directly when all connections not in txn #44953

Merged

bb7133 mentioned this pull request Nov 22, 2023

tidb panic at graceful shutdown #36793

Closed

djshow832 added the needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. label Nov 27, 2023

ti-chi-bot mentioned this pull request Nov 27, 2023

server: enhance graceful stop by closing connections after finish the ongoing txn (#32111) #48905

Merged

12 tasks

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Nov 27, 2023

This is an automated cherry-pick of pingcap#32111

9aacfb1

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>

ti-chi-bot mentioned this pull request Nov 28, 2023

server: Set default drainClientWait as 15s (#42134) #48944

Merged

12 tasks

ti-chi-bot bot pushed a commit that referenced this pull request Nov 28, 2023

server: enhance graceful stop by closing connections after finish the…

b504f0b

… ongoing txn (#32111) (#48905) close #32110

crazycs520 pushed a commit to crazycs520/tidb that referenced this pull request Nov 29, 2023

server: enhance graceful stop by closing connections after finish the…

30092ea

… ongoing txn (pingcap#32111) (pingcap#48905) close pingcap#32110

crazycs520 mentioned this pull request Nov 29, 2023

server: enhance graceful stop by closing connections after finish the ongoing txn (#32111) (#48905) #48989

Merged

12 tasks

crazycs520 added a commit that referenced this pull request Nov 29, 2023

server: enhance graceful stop by closing connections after finish the…

2db6289

… ongoing txn (#32111) (#48905) (#48989) close #32110 Co-authored-by: Ti Chi Robot <ti-community-prow-bot@tidb.io>

crazycs520 mentioned this pull request Nov 29, 2023

server: Set default drainClientWait as 15s (#42134) (#48944) #48990

Merged

12 tasks

This was referenced Dec 1, 2023

Need to revert #32111 #49098

Closed

Revert "server: Set default drainClientWait as 15s (#42134) (#48944)" #49101

Merged

bb7133 added a commit to bb7133/tidb that referenced this pull request Dec 4, 2023

Revert "server: enhance graceful stop by closing connections after fi…

f1e8034

…nish the ongoing txn (pingcap#32111) (pingcap#48905)" This reverts commit b504f0b.

bb7133 mentioned this pull request Dec 14, 2023

server: exit tidb-server directly when all connections not in txn (#44953) #49481

Merged

12 tasks

YangKeao mentioned this pull request Mar 19, 2024

Connection count metric can be less than the real value #51889

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: enhance graceful stop by closing connections after finish the ongoing txn #32111

server: enhance graceful stop by closing connections after finish the ongoing txn #32111

july2993 commented Feb 6, 2022 •

edited

Loading

ti-chi-bot commented Feb 6, 2022 •

edited

Loading

sre-bot commented Feb 6, 2022

sre-bot commented Feb 7, 2022

july2993 commented Mar 6, 2023

hawkingrei commented Mar 7, 2023

july2993 commented Mar 7, 2023

ti-chi-bot commented Mar 7, 2023

ti-chi-bot commented Nov 27, 2023

server: enhance graceful stop by closing connections after finish the ongoing txn #32111

server: enhance graceful stop by closing connections after finish the ongoing txn #32111

Conversation

july2993 commented Feb 6, 2022 • edited Loading

What problem does this PR solve?

What is changed and how it works?

How to test

why it failed

How tidb handles exits,

go sql driver

Reason summary

Optimize

Summarize

Check List

Release note

ti-chi-bot commented Feb 6, 2022 • edited Loading

sre-bot commented Feb 6, 2022

sre-bot commented Feb 7, 2022

july2993 commented Mar 6, 2023

hawkingrei commented Mar 7, 2023

july2993 commented Mar 7, 2023

ti-chi-bot commented Mar 7, 2023

ti-chi-bot commented Nov 27, 2023

july2993 commented Feb 6, 2022 •

edited

Loading

ti-chi-bot commented Feb 6, 2022 •

edited

Loading