Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck #10239

Closed
hicqu opened this issue Dec 4, 2023 · 12 comments · Fixed by #10237
Closed
Assignees
Labels
affects-6.5 affects-7.1 affects-7.5 area/ticdc Issues or PRs related to TiCDC. release-blocker This issue blocks a release. Please solve it ASAP. report/customer Customers have encountered this bug. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@hicqu
Copy link
Contributor

hicqu commented Dec 4, 2023

What did you do?

We have found there could be many "stream to store closed" errors.
We think it's because of this logic in cdc/kv/region_worker.go:

// checkShouldExit checks whether the region worker should exit, if exit return an error
func (w *regionWorker) checkShouldExit() error {
    empty := w.checkRegionStateEmpty()
    // If there is no region maintained by this region worker, exit it and
    // cancel the gRPC stream.
    if empty { /* It should be wrong because pending regions are not considered correctly. */
        w.cancelStream(time.Duration(0))
        return cerror.ErrRegionWorkerExit.GenWithStackByArgs()
    }
    return nil
}

What did you expect to see?

No response

What did you see instead?

Pending regions should be considered correctly in checkShouldExit.

Versions of the cluster

6.5.x, 7.1.x.

Fix versions

  • v6.5.9
  • v7.1.4
@hicqu
Copy link
Contributor Author

hicqu commented Dec 4, 2023

It's duplicated with #10136 .

@asddongmen
Copy link
Contributor

$ grep deregister tikv-server-2024-01-17T23-03-35.694.log | grep conn|wc -l
9289

In a recent incident investigation, we found that the latency of CDC kept increasing to a level of ten minutes after tikv upgrade.

The grep result above means that in v6.5.6 TiCDC would still cancel gRPC connection incorrectly. (Cancel connection 9289 times in 30 second, which is unexcpeted.)
We need to reopen this issue again and do further investigation to solve it.

@asddongmen asddongmen reopened this Jan 25, 2024
@asddongmen asddongmen removed the duplicate Issues or pull requests already exists. label Jan 26, 2024
@asddongmen
Copy link
Contributor

I successfully reproduced the user's phenomenon locally by using the modified kvClient in: asddongmen@c78377f.

image

The root cause of this problem is that the stream inside the eventFeedSession, which represent connection to a tikv, are being added/canceled/deleted in multiple goroutines. The order of these actions is not properly handled, resulting in the phenomenon where a stream is created in a goroutine(g-1) and then deleted by the other goroutine(g-2). Eventually, this will lead to TiCDC continuously initiating and canceling connections to the same TiKV in a loop.

I will filed a pull request to fix this issue ASAP.

@asddongmen asddongmen self-assigned this Jan 26, 2024
@asddongmen asddongmen changed the title TiCDC may cancel grpc streams incorrectly TiCDC may incorrectly cancel gRPC streams to TiKV, causing the resolvedTs to become stuck Jan 26, 2024
@asddongmen
Copy link
Contributor

asddongmen commented Jan 29, 2024

The root cause is described as follows:
In the kvClient, the requestRegionToStore function listens for new region information from the regionRouter in a for loop. Upon receiving new region info, it checks for an existing gRPC stream in the cache:

  • If it doesn't exist, a new gRPC stream to the store is created and added to the cache using store.addr as the key. Then, a goroutine is spawned (let's call it g-2) to handle events from the gRPC stream. [1]
  • If it exists, CDC sends a new request to listen to the region. [2]

After restarting TiKV, an error at line:699 leads to canceling and deleting the corresponding gRPC stream at line:728 by store.addr (delete-1). The g-2 goroutine notices the cancellation and exits. Then, at line:672, the stream is deleted from the cache by store.addr again (delete-2).

If a new region for this store arrives from regionRouter between delete-1 and delete-2, a new stream to the store is created and cached as described above at [1].The stream deleted from the cache at delete-2 would be the new stream. This may lead to a continuous loop of stream creation and deletion until g-2 can exit before a new stream is created, breaking the loop.

@shiqinwang
Copy link

Thanks @asddongmen for the detailed explanation. I have a few questions:

  1. Why does the delete-2 happen? Is it an unexpected behavior?
  2. How do you know this is what happened in the reported incident? Are there any logs as evidence?
  3. What is the error at line 699 that leads to the stream cancellation?

I am a bit skeptical about this root cause as it seems pretty rare that the exact sequence (delete-1, create a new stream, delete-2) can happen, let alone in the incident, the impact last for 15 min. Do you mean the exact sequence occurred in a loop for 15 min and for ~200 regions at the same time?

@asddongmen
Copy link
Contributor

asddongmen commented Jan 31, 2024

@shiqinwang

  1. delete-2 is unexpected, I have remove delete-2 in this PR: https://github.com/pingcap/tiflow/pull/10559/files#diff-31a8a1f4c2bd92516aebe3c694c8ab06eb69f88aa09a95212f1f5e81a91fcf1cL696

  2. Yes there are logs are evident:

    grep -E 'creating new stream to store to send request|stream to store closed|receive from stream canceled'  ticdc-server-2024-01-17T23-03-53.087.log | grep "addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160" | wc -l
        2126
    

    The grep result above indicates that kvClient creates and cancels streams to the same store (jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160) a thousand times within 10 seconds. Additionally, other logs reveal that all these streams are associated with the same table: [tableID=189] [tableName=jobs.actions].

    Upon examining the log, we observe multiple instances of "stream to store closed" logs appearing between two "creating new stream to store to send request" logs. However, it is expected that a creation and a close should occur in pairs.

    [2024/01/17 23:03:47.312 +00:00] [INFO] [client.go:663] ["creating new stream to store to send request"] [namespace=default] [changefeed=jobs-jobs] [regionID=55512292] [requestID=9106229] [storeID=52517211] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160]
    
    [2024/01/17 23:03:47.320 +00:00] [INFO] [client.go:1060] ["receive from stream canceled"] [namespace=default] [changefeed=jobs-jobs] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211]
    [2024/01/17 23:03:47.320 +00:00] [INFO] [client.go:995] ["stream to store closed"] [namespace=default] [changefeed=jobs-jobs] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211]
    [2024/01/17 23:03:47.321 +00:00] [INFO] [client.go:995] ["stream to store closed"] [namespace=default] [changefeed=jobs-jobs] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211]
    
    [2024/01/17 23:03:47.323 +00:00] [INFO] [client.go:663] ["creating new stream to store to send request"] [namespace=default] [changefeed=jobs-jobs] [regionID=55954657] [requestID=9106268] [storeID=52517211] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160]
    

    Based on the logs above, it appears that there are instances of unexpected creation and deletion of gRPC streams.

  3. The provided logs are incomplete, so we cannot determine the cause of the original error at line [699](

    err = stream.client.Send(req)
    ).

  4. It is like this.
    All regions of the same store share a gRPC stream. When a stream is canceled, all the regions (potentially thousands) that share this stream will need to reconnect due to the error.
    Therefore, if this situation occurs once, it will cause a cascade effect afterwards. Because when a stream is canceled due to an error, other regions that depend on it will also disconnect and then attempt to reconnect. Within a few seconds, there may be hundreds of thousands of region reconnection requests, making it easy to accidentally encounter this error again.

The main idea of the fix is to prevent unexpected cancellation and creation of streams to a same TiKV store in a short period of time. After applying the fix, for a table, only one gRPC stream will be closed and created when TiKV restarts.

grep "creating new stream to store to send request" cdc.log cdc-2024-01-30T12-15-04.293.log | grep 10.2.7.146:20160 | grep tableID=146 | wc -l
1

@shiqinwang
Copy link

@asddongmen Thanks for the explanation.

I saw a lot of warn logs like ticdc-server-2024-01-17T23-00-22.258.log:[2024/01/17 23:00:18.527 +00:00] [WARN] [client.go:704] ["send request to stream failed"] [namespace=default] [changefeed=jobs-jobs] [tableID=189] [tableName=`jobs`.`actions`] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211] [regionID=55693736] [requestID=7302837] [error=EOF] What does error=EOF mean? I am curious why did ticdc get so many failures when calling requestRegionToStore?
After removing delete-2, would ticdc still encounter such send request failures? If so, it would still stuck in the reconnection loop right?

Another question is whether it is safe to remove the delete-2. After the stream is successfully connected, what happens if later g-1 receives some errors from the stream and doesn't do the cleanup? Which code path will clean up the stream?

@shiqinwang
Copy link

My understanding on the sequence of events in your theory is:

  1. Sent a stream request s1
  2. Spawned g1
  3. Encountered an error
  4. Deleted the stream s1
  5. Sent a request to create the stream s2
  6. Spawned g2
  7. g1 deleted s2
  8. Sent a request to create the stream s3
  9. Spawned g3
  10. g2 deleted s3

Is that correct? if g3 exists before s4 is created, then the loop stops

@asddongmen
Copy link
Contributor

@asddongmen Thanks for the explanation.

I saw a lot of warn logs like ticdc-server-2024-01-17T23-00-22.258.log:[2024/01/17 23:00:18.527 +00:00] [WARN] [client.go:704] ["send request to stream failed"] [namespace=default] [changefeed=jobs-jobs] [tableID=189] [tableName=`jobs`.`actions`] [addr=jobs-tikv-8.jobs-tikv-peer.tidb.svc:20160] [storeID=52517211] [regionID=55693736] [requestID=7302837] [error=EOF] What does error=EOF mean? I am curious why did ticdc get so many failures when calling requestRegionToStore? After removing delete-2, would ticdc still encounter such send request failures? If so, it would still stuck in the reconnection loop right?

Another question is whether it is safe to remove the delete-2. After the stream is successfully connected, what happens if later g-1 receives some errors from the stream and doesn't do the cleanup? Which code path will clean up the stream?

  1. All regions of the same store send requests through a single gRPC stream. Therefore, if the stream encounters an error and is canceled, all upcoming regions' requestRegionToStore operations will fail.
  2. Even after removing delete-2, TiCDC will still encounter send request failures when a stream is canceled. This can happen when there is an error while receiving a message from the stream and new regions from the regionRouter need to send new requests. However, the fix PR ensures that TiCDC does not get stuck in a reconnection loop due to repeated occurrences of unexpected gRPC stream canceling. The following strategies are used to address this issue:
    • Adds a streamAlterInterval to ensure that a gRPC stream of the same store can only be created or deleted once within the specified streamAlterInterval period.
    • Adds a stream ID to identify a gRPC stream and prevent a new stream from being unexpectedly deleted by an old stream holder. The stream ID is checked before every deletion.
    • Ensures that addStream and deleteStream are called in the same goroutine, preventing the occurrence of a loop.

@asddongmen
Copy link
Contributor

My understanding on the sequence of events in your theory is:

  1. Sent a stream request s1
  2. Spawned g1
  3. Encountered an error
  4. Deleted the stream s1
  5. Sent a request to create the stream s2
  6. Spawned g2
  7. g1 deleted s2
  8. Sent a request to create the stream s3
  9. Spawned g3
  10. g2 deleted s3

Is that correct? if g3 exists before s4 is created, then the loop stops

Yes. When a TiKV is just restarted, there will be a huge amount of region leader transfers, and this leads to a huge number of regions from regionRouter that need to send requests. So it is easier to get stuck in a loop in such a situation.

@asddongmen
Copy link
Contributor

asddongmen commented Feb 1, 2024

Another question is whether it is safe to remove the delete-2. After the stream is successfully connected, what happens if later g-1 receives some errors from the stream and doesn't do the cleanup? Which code path will clean up the stream?

delete-1 at line:728 will do the clean up.
Let's consider a scenario as belows:

  • A gRPC stream(s1, addr=x) encounters an error, it will be canceled in g1 but will not be deleted by g1.
  • A region(r1, addr=x) coming from regionRouter in the for loop, r1 get the dead stream(s1) from the cache with addr=x as key, and use it to send a request, fails, an error will be return at line:699 and s1 will be deleted from the cache at line:728.
  • r1 being sent to the regionRouter again, and will be handled later.

@seiya-annie
Copy link

/found customer

@ti-chi-bot ti-chi-bot bot added the report/customer Customers have encountered this bug. label Jun 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-6.5 affects-7.1 affects-7.5 area/ticdc Issues or PRs related to TiCDC. release-blocker This issue blocks a release. Please solve it ASAP. report/customer Customers have encountered this bug. severity/moderate type/bug The issue is confirmed as a bug.
Projects
Development

Successfully merging a pull request may close this issue.

4 participants