The timeout should consider the transfer latency especially in big cluster #5596
Closed
Description
opened on Oct 13, 2022
Bug Report
In big cluster, the duration of region heartbeat is bigger than 10s. For example:
the region heartbeat qps for every store : 1w ops
there are 10w region in this store, so the latency is 10w/1w=10s
but some short operator step timeout limit is 10s, such as promote/demote/transfer leader, so the operator failed because the duration is bigger than 10s.
some log:
[2022/10/13 00:52:23.560 +00:00] [INFO] [operator_controller.go:590] ["operator timeout"] [region-id=1820635949] [takes=1m20.047056718s] [operator="\"balance-region {mv peer: store [14057] to [14053]} (kind:region, region:1820635949(540, 1763), createAt:2022-10-13 00:51:03.513592708 +0000 UTC m=+461704.590481929, startAt:2022-10-13 00:51:03.513683079 +0000 UTC m=+461704.590572320, currentStep:1, size:16, steps:[add learner peer 1832839605 on store 14053, use joint consensus, promote learner peer 1832839605 on store 14053 to voter, demote voter peer 1820635950 on store 14057 to learner, leave joint state, promote learner peer 1832839605 on store 14053 to voter, demote voter peer 1820635950 on store 14057 to learner, remove peer on store 14057]) timeout\""] [additional-info="{\"sourceScore\":\"5538250.44\",\"targetScore\":\"5519343.11\"}"]
there are 5w regions in each store.
the operator step duration
the duration of region heartbeat
What did you do?
What did you expect to see?
operators are successful.
What did you see instead?
operators are failed
What version of PD are you using (pd-server -V
)?
master, v6.1.0
Activity