Skip to content

BR: retry for PD request error and TiKV IO error. #27787

Closed
@Little-Wallace

Description

Bug Report

I use BR to backup data from TiKV to S3. And BR task failed with the following log:

["failed to backup"] [error="msg:\"Io(Custom { kind: Other, error: \\\"failed to put object timeout after 15mins for upload in s3 storage\\\" })\" : [BR:KV:ErrKVUnknown]unknown tikv error"] [errorVerbose="[BR:KV:ErrKVUnknown]unknown tikv error\nmsg:\"Io(Custom { kind: Other, error: \\\"failed to put object timeout after 15mins for upload in s3 storage\\\" })\"

I check the code and find that IO::Custom does not include in errors which are allowed to retry request.

https://github.com/pingcap/tidb/blob/master/br/pkg/backup/push.go#L162

But I think it is usually that the requests to S3 fail or break just because timeout. Because user usually deploy a S3 cluster with the normal disk instead of nvme. So we shall retry this kind of error to avoid the whole BR task fails.

We also need to retry for pd request here:

https://github.com/pingcap/tidb/blob/master/br/pkg/backup/client.go#L464

1. Minimal reproduce step (Required)

2. What did you expect to see? (Required)

BR shall retry request automatic.

3. What did you see instead (Required)

4. What is your TiDB version? (Required)

TiDB v4.0.6, TiDB v4.0.13

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Labels

    component/brThis issue is related to BR of TiDB.type/bugThe issue is confirmed as a bug.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions