Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow evicting leader scheduling when lightning importing #7853

Closed
rleungx opened this issue Feb 27, 2024 · 8 comments · Fixed by #8303
Closed

Allow evicting leader scheduling when lightning importing #7853

rleungx opened this issue Feb 27, 2024 · 8 comments · Fixed by #8303
Labels
type/enhancement The issue or PR belongs to an enhancement.

Comments

@rleungx
Copy link
Member

rleungx commented Feb 27, 2024

Enhancement Task

Previously, lightning used the global pause policy to avoid scheduling when importing data, which allowed evicting the leader in the process. After #4649, we support pausing scheduling a specified key range through a label rule. But it will pause all scheduling including the evict leader.
Recently, we encountered an issue: pingcap/tidb#51143. The import process will block the pod from being rescheduled to another node. In this case, pausing evicting the leader is not suitable. At least, we need to make it compatible with the previous behavior.

@okJiang
Copy link
Member

okJiang commented May 29, 2024

Hi, after I understand the context, I think that if it was just to solve this problem, it would be enough to filter out EvictLeaderScheduler when checking ScheduleDisable. Just here

for _, op := range ops {
if labelMgr := s.cluster.GetRegionLabeler(); labelMgr != nil {
region := s.cluster.GetRegion(op.RegionID())
if region == nil {
continue
}
if labelMgr.ScheduleDisabled(region) {
denySchedulersByLabelerCounter.Inc()
foundDisabled = true
break
}
}
}

But that would make the interface proprietary and weaken its generalization.

So would you consider providing more keys like schedule.evict-leader into the Rule? If the client specifies deny schedule, it suspends all schedulers, and if the client specifies deny schedule and allow schedule.evict-leader, it disables all schedules except the evict leader. In this way, we can more easily extend other behaviors. This latter approach requires more detailed spec and design documentation, more development and discussion work.

@JmPotato
Copy link
Member

Hi, after I understand the context, I think that if it was just to solve this problem, it would be enough to filter out EvictLeaderScheduler when checking ScheduleDisable. Just here

for _, op := range ops {
if labelMgr := s.cluster.GetRegionLabeler(); labelMgr != nil {
region := s.cluster.GetRegion(op.RegionID())
if region == nil {
continue
}
if labelMgr.ScheduleDisabled(region) {
denySchedulersByLabelerCounter.Inc()
foundDisabled = true
break
}
}
}

But that would make the interface proprietary and weaken its generalization.
So would you consider providing more keys like schedule.evict-leader into the Rule? If the client specifies deny schedule, it suspends all schedulers, and if the client specifies deny schedule and allow schedule.evict-leader, it disables all schedules except the evict leader. In this way, we can more easily extend other behaviors. This latter approach requires more detailed spec and design documentation, more development and discussion work.

We could do both step by step. First providing the same behavior as before to skip the evict leader scheduler, then extend the ability to stop any other scheduler.

@okJiang
Copy link
Member

okJiang commented Jun 3, 2024

Do you have any suggestions? @rleungx

@rleungx
Copy link
Member Author

rleungx commented Jun 3, 2024

I think we need to clarify the semantics of deny schedule first. #4649 creates this label for denying all scheduling including the evict leader scheduler. But apparently it doesn't meet the needs of the caller. If we are going to support such as schedule.evict-leader or the other labels, we should define a priority first to avoid conflicts among each label.

@okJiang
Copy link
Member

okJiang commented Jun 3, 2024

@rleungx You seem to prefer the second option. I completely agree with your thoughts. Let me briefly describe the prioritization issue. I try to use dot to express hierarchical relationship. A.B means that the rank of A is greater than that of B, i.e., A contains B. In this case, we prioritize A before B. For example, if a user sets deny A and allow A.B, then allow B on top of deny A, instead of denying A directly to override allow B. The case of allow A and deny A.B is similar.

The details can't be described in a few sentences, so I'd like to know how the team would like to handle this issue, and I'll write a more detailed proposal if team are inclined to implement the second one.

@okJiang
Copy link
Member

okJiang commented Jun 4, 2024

Which solution would you like? @lance6716

@lance6716
Copy link
Contributor

lance6716 commented Jun 4, 2024

and if the client specifies deny schedule and allow schedule.evict-leader

I think this one is better. Client can configure what rules it wants to be excluded from "deny all".

@okJiang
Copy link
Member

okJiang commented Jun 17, 2024

After discuss with @rleungx @easonn7 @lance6716 , we ultimately opted for the first method.

This is because after searching for all possible block situations related to the scheduler, I found that the only situation where TiKV can be taken offline using ti-operator when lightning sets the rule label.

So, the first solution completely meets our needs.

@ti-chi-bot ti-chi-bot bot closed this as completed in 26e90e9 Jun 24, 2024
ti-chi-bot pushed a commit to ti-chi-bot/pd that referenced this issue Sep 9, 2024
ref tikv#7300, close tikv#7853

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
rleungx pushed a commit to rleungx/pd that referenced this issue Sep 10, 2024
…el (tikv#8303)

ref tikv#7300, close tikv#7853

- add a real cluster test to test `skip evict-leader-scheduler when setting schedule deny label`
- add `DeleteStoreLabel` API and `DeleteScheduler` API

Signed-off-by: okJiang <819421878@qq.com>
rleungx pushed a commit to rleungx/pd that referenced this issue Sep 10, 2024
…el (tikv#8303)

ref tikv#7300, close tikv#7853

- add a real cluster test to test `skip evict-leader-scheduler when setting schedule deny label`
- add `DeleteStoreLabel` API and `DeleteScheduler` API

Signed-off-by: okJiang <819421878@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/enhancement The issue or PR belongs to an enhancement.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants