BR: retry for PD request error and TiKV IO error. #27787
Closed
Description
Bug Report
I use BR to backup data from TiKV to S3. And BR task failed with the following log:
["failed to backup"] [error="msg:\"Io(Custom { kind: Other, error: \\\"failed to put object timeout after 15mins for upload in s3 storage\\\" })\" : [BR:KV:ErrKVUnknown]unknown tikv error"] [errorVerbose="[BR:KV:ErrKVUnknown]unknown tikv error\nmsg:\"Io(Custom { kind: Other, error: \\\"failed to put object timeout after 15mins for upload in s3 storage\\\" })\"
I check the code and find that IO::Custom
does not include in errors which are allowed to retry request.
https://github.com/pingcap/tidb/blob/master/br/pkg/backup/push.go#L162
But I think it is usually that the requests to S3 fail or break just because timeout. Because user usually deploy a S3 cluster with the normal disk instead of nvme. So we shall retry this kind of error to avoid the whole BR task fails.
We also need to retry for pd request here:
https://github.com/pingcap/tidb/blob/master/br/pkg/backup/client.go#L464
1. Minimal reproduce step (Required)
2. What did you expect to see? (Required)
BR shall retry request automatic.
3. What did you see instead (Required)
4. What is your TiDB version? (Required)
TiDB v4.0.6, TiDB v4.0.13
Activity