Regions get stuck in 2 voters, 1 down peer, 1 learner state #6559
Description
Bug Report
What did you do?
In a 3 nodes cluster, replace a broken store with a new one.
What did you expect to see?
The cluster returns to normal after the operation.
What did you see instead?
TiKVRegionPendingPeerTooLong alarm is fired.
There are 3 regions that experience "pending-peer" problem for 2 days. They all have 4 peers: 2 regular healthy voters, 1 healthy learner (located in the new store 2751139) 1 down peer (in the manually deleted store 4).
Example region info, click me
{
"id": 55929554,
"epoch": {
"conf_ver": 6,
"version": 109399
},
"peers": [
{
"id": 55929555,
"store_id": 1,
"role_name": "Voter"
},
{
"id": 55929556,
"store_id": 4,
"role_name": "Voter"
},
{
"id": 55929557,
"store_id": 5,
"role_name": "Voter"
},
{
"id": 55929558,
"store_id": 2751139,
"role": 1,
"role_name": "Learner",
"is_learner": true
}
],
"leader": {
"id": 55929555,
"store_id": 1,
"role_name": "Voter"
},
"down_peers": [
{
"down_seconds": 40307,
"peer": {
"id": 55929556,
"store_id": 4,
"role_name": "Voter"
}
}
],
"pending_peers": [
{
"id": 55929556,
"store_id": 4,
"role_name": "Voter"
}
],
"cpu_usage": 0,
"written_bytes": 0,
"read_bytes": 0,
"written_keys": 0,
"read_keys": 0,
"approximate_size": 1,
"approximate_keys": 40960
}
This state is probably due to an unfinished recovery process. Usually, this intermediate state can be resolved by PD automatically in 2 ways:
- This state does not comply with the 3 replicas rule. So, PD tried to remove one replica, the peer with "unusual role" (in this case, the learner) is preferred in this case. But, to proceed with this operation, it requires all other peers to be healthy, which is not true in this case. So, this is skipped. This can be confirmed by PD metric "skip-remove-orphan-peer".
- This state does not comply with the "no down peer" rule. So, PD tried to remove the down peer and add a new peer, this is done through: 1. add a learner. 2. promote learner + demote voter through joint consensus 3. remove demoted learner. But, since this cluster only has 3 nodes, and all of them already have a peer belonging to these regions, so this operation is also not able to proceed. This can be confirmed by PD metrics "replace-down" and "no-store-replace".
Because of above constraints, these 3 regions get stuck in this state.
PD should be able to handle this case. e.g. When find a region with 4 peers, 2 voters + 1 down peer + 1 learner. It promotes the learner to be a voter and removes the down peer.
What version of PD are you using (pd-server -V
)?
6.5.0