Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish WaitLeader to test more scenarios #8359

Open
HuSharp opened this issue Jul 4, 2024 · 0 comments
Open

Distinguish WaitLeader to test more scenarios #8359

HuSharp opened this issue Jul 4, 2024 · 0 comments
Assignees
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@HuSharp
Copy link
Member

HuSharp commented Jul 4, 2024

Enhancement Task

We need to distinguish the definition of a differentiated WaitLeader

  • All pd recognized a leader (such as testing sync client).
  • The follower does not need to recognize the leader while the leader has been elected at the moment.

Let's explain this test's error, And here is detailed log

2024-07-02T13:54:15.0492409Z --- FAIL: TestPrepareCheckerWithTransferLeader (76.09s)
2024-07-02T13:54:15.0492734Z     region_syncer_test.go:289: 
2024-07-02T13:54:15.0493797Z         	Error Trace:	/home/runner/work/pd/pd/tests/server/region_syncer/region_syncer_test.go:289
2024-07-02T13:54:15.0494204Z         	Error:      	Not equal: 
2024-07-02T13:54:15.0494642Z         	            	expected: "pd2"
2024-07-02T13:54:15.0495194Z         	            	actual  : ""
2024-07-02T13:54:15.0495495Z         	            	
2024-07-02T13:54:15.0496104Z         	            	Diff:
2024-07-02T13:54:15.0496743Z         	            	--- Expected
2024-07-02T13:54:15.0497390Z         	            	+++ Actual
2024-07-02T13:54:15.0498020Z         	            	@@ -1 +1 @@
2024-07-02T13:54:15.0498413Z         	            	-pd2
2024-07-02T13:54:15.0498676Z         	            	+
2024-07-02T13:54:15.0499001Z         	Test:       	TestPrepareCheckerWithTransferLeader
2024-07-02T13:54:15.0499178Z FAIL

Conclusion:

The WaitLeader times out because all pd's do not all recognize the leader.

When Checkleader's etcd get is too slow(> leader lease)

  • follower was blocked, failing to recognize the current leader.
  • The old leader exits leadership as lease time is reached within the follower blocking time.
  • follower watch with revision can still fetch the leader exit event and then exit watch status.

The above event loop occurred for 50 seconds, causing the WaitLeader to time out.

Detail explanation

multiple leaders changed

Due to the slow disk, the leader changed 8 times in the 50s!

[2024/07/02 13:53:01.139 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1]
[2024/07/02 13:53:07.537 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1]

!!! resign triggered by API !!!
2024-07-02T13:54:14.8419029Z [2024/07/02 13:53:11.565 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd1] [to=]

[2024/07/02 13:53:12.651 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
[2024/07/02 13:53:18.726 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]
[2024/07/02 13:53:24.777 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]

!!! resign pd-2 ->0 !!!
[2024/07/02 13:53:30.827 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd2] [to=]
[2024/07/02 13:53:31.329 +00:00] [ERROR] [server.go:1718] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=pd2] [error="[PD:server:ErrLeaderFrequentlyChange]leader pd2 frequently changed, leader-key is [/pd/7387036149936305935/leader]"]

[2024/07/02 13:53:31.833 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1] 1

!!! resign pd-1 -> 0 !!!
[2024/07/02 13:53:37.894 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd1] [to=]
[2024/07/02 13:53:38.395 +00:00] [ERROR] [server.go:1718] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=pd1] [error="[PD:server:ErrLeaderFrequentlyChange]leader pd1 frequently changed, leader-key is [/pd/7387036149936305935/leader]"]

[2024/07/02 13:53:38.897 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2] 2
[2024/07/02 13:53:44.934 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2] 3
[2024/07/02 13:53:50.978 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2] 4

!!! resign pd-2 -> 0 !!!
[2024/07/02 13:53:57.018 +00:00] [INFO] [member.go:358] ["try to resign etcd leader to next pd-server"] [from=pd2] [to=]
[2024/07/02 13:53:57.520 +00:00] [ERROR] [server.go:1718] ["campaign PD leader meets error due to etcd error"] [campaign-leader-name=pd2] [error="[PD:server:ErrLeaderFrequentlyChange]leader pd2 frequently changed, leader-key is [/pd/7387036149936305935/leader]"]
[2024/07/02 13:53:58.021 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd1] 5

failure WaitLeader

!!! pd2 was elected leader at 13:53:12.651 !!!
[2024/07/02 13:53:12.651 +00:00] [INFO] [server.go:1738] ["campaign PD leader ok"] [campaign-leader-name=pd2]

!!! pd1 skip campaigning stops at 13:53:12.479, at which point it goes to getPersistentLeader in CheckLeader, and then it takes 6s to get the leader and start watching !!!
[2024/07/02 13:53:12.278 +00:00] [INFO] [server.go:1700] ["skip campaigning of pd leader and check later"] [server-name=pd1] [etcd-leader-id=13430743067140231612] [member-id=15634440769998151696]
[2024/07/02 13:53:12.479 +00:00] [INFO] [server.go:1700] ["skip campaigning of pd leader and check later"] [server-name=pd1] [etcd-leader-id=13430743067140231612] [member-id=15634440769998151696]

!!! CheckLeader !!!
[2024/07/02 13:53:18.684 +00:00] [WARN] [etcdutil.go:153] ["kv gets too slow"] [request-key=/pd/7387036149936305935/leader] [cost=6.004986363s] []
[2024/07/02 13:53:18.725 +00:00] [INFO] [server.go:1669] ["start to watch pd leader"] [pd-leader="name:\"pd2\" member_id:13430743067140231612 peer_urls:\"http://127.0.0.1:38109\" client_urls:\"http://127.0.0.1:40999\" "]

!!! pd1 watch found the leader changed after only 150ms, resulting in the WaitLeader not being successful !!!
[2024/07/02 13:53:18.862 +00:00] [INFO] [leadership.go:388] ["current leadership is deleted"] [revision=57] [leader-key=/pd/7387036149936305935/leader] [purpose="leader election"]
[2024/07/02 13:53:19.728 +00:00] [INFO] [server.go:1673] ["pd leader has changed, try to re-campaign a pd leader"]
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

No branches or pull requests

1 participant