-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce the number of regions getting paused scheduling during lightning physical backend import #51143
Comments
The problem is new regions are empty and will be merged after
If lightning uses tidb/br/pkg/lightning/importer/import.go Lines 1577 to 1584 in fa340f3
And the new pause logic should not block evicting leader. Please be aware that this is a global level pausing thus will affect all regions. Similarly, we can develop PD to add a new kind of key range pausing that allow evicting leader. |
Thanks for the quick response @lance6716 !
Let's say if I have 4 regions and 1 worker thread. Instead of split 4 regions and stop scheduler for all of them, I split and stop scheduler for 1 region at a time. Once I finish importing one region, I split and stop scheduler for another region. Will it work? I don't need to worry about empty region getting merged right?
Could you explain more about this? From the code looks like it'll still pause schedulers, why will it not block evicting leader? |
It sounds OK generally and need to check details later. But the main drawback of this design is that we don't dare to pick such a large code change to release branches like v6.5. I suggest we choose other workaround.
Lines 681 to 701 in 1fc92b3
And "global" option will remove schedulers Lines 89 to 100 in 1fc92b3
So they have different behaviours, you can ask PD repo to learn more details. |
Hi @lance6716, going back to my first proposal, we have another idea: Since pd will only merge newly split region after split-merge-interval, I could only move the PauseSchedulersByKeyRange to the worker thread and not move the SplitAndScatterRegionInBatches. I need to make sure the import time is less than the split-merge-interval so the region will not be merged before the import. What do you think? |
Above is a short term solution that we want to make in 6.5. For long term solution we think it may need a pd change which we can follow up later. Also another question: ideally do we need to stop leader eviction during lightning import? Maybe we just need to stop the region merge scheduler? |
LGTM but we still need tests to see if the behaviour is expected. We can add a option like And for the long term solution you can open another issue in PD repo. |
We can allow evicting leader when using |
Hi @lance6716 , I raise a draft PR about the short term solution of adding an option like |
Feature Request
when using lightning physical backend to import data, it'll paused the scheduling of all regions of the target table. For example, if a table has 1024 regions, the scheduling of all of them will be paused until the import is finished, even though there're only 4 threads in lightning are importing at the same time.
What I think may be improved is that, the number of regions that should be paused scheduling should equal to the number of import threads, which is 4 in the above example, instead of all 1024 regions. That may require to move the call of
PauseSchedulersByKeyRange
andSplitAndScatterRegionInBatches
to the worker insidewriteAndIngestByRanges
.We want this feature to be on lightning 6.5. Please let me know if you have any concern about this.
Is your feature request related to a problem? Please describe:
The reason we want to reduce the number of regions getting paused scheduling is that, our TiDB cluster is running on the k8s cluster. In the k8s cluster we have a process of node drainer which will periodically drainer a k8s node and then replace it with a new node. During the draining, it'll try to kill the TiKV pod and schedule it to another node. But if some regions on the TiKV are paused scheduling , TiKV can not evict leader and block the pod to be rescheduled to another node. If the lightning takes too long to import data, this whole process will be blocked for a long time. So we hope to reduce the number of regions getting paused scheduling in order to reduce the time of node drainer getting blocked.
Describe the feature you'd like:
Describe alternatives you've considered:
Teachability, Documentation, Adoption, Migration Strategy:
The text was updated successfully, but these errors were encountered: