Skip to content

The timeout should consider the transfer latency especially in big cluster #5596

Closed
@bufferflies

Description

Bug Report

In big cluster, the duration of region heartbeat is bigger than 10s. For example:
the region heartbeat qps for every store : 1w ops
there are 10w region in this store, so the latency is 10w/1w=10s
but some short operator step timeout limit is 10s, such as promote/demote/transfer leader, so the operator failed because the duration is bigger than 10s.

some log:

[2022/10/13 00:52:23.560 +00:00] [INFO] [operator_controller.go:590] ["operator timeout"] [region-id=1820635949] [takes=1m20.047056718s] [operator="\"balance-region {mv peer: store [14057] to [14053]} (kind:region, region:1820635949(540, 1763), createAt:2022-10-13 00:51:03.513592708 +0000 UTC m=+461704.590481929, startAt:2022-10-13 00:51:03.513683079 +0000 UTC m=+461704.590572320, currentStep:1, size:16, steps:[add learner peer 1832839605 on store 14053, use joint consensus, promote learner peer 1832839605 on store 14053 to voter, demote voter peer 1820635950 on store 14057 to learner, leave joint state, promote learner peer 1832839605 on store 14053 to voter, demote voter peer 1820635950 on store 14057 to learner, remove peer on store 14057]) timeout\""] [additional-info="{\"sourceScore\":\"5538250.44\",\"targetScore\":\"5519343.11\"}"]

there are 5w regions in each store.

the operator step duration
image
the duration of region heartbeat
image

What did you do?

What did you expect to see?

operators are successful.

What did you see instead?

operators are failed

What version of PD are you using (pd-server -V)?

master, v6.1.0

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions