Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce the number of regions getting paused scheduling during lightning physical backend import #51143

Open
guoshouyan opened this issue Feb 19, 2024 · 8 comments
Labels
type/feature-request Categorizes issue or PR as related to a new feature.

Comments

@guoshouyan
Copy link
Contributor

Feature Request

when using lightning physical backend to import data, it'll paused the scheduling of all regions of the target table. For example, if a table has 1024 regions, the scheduling of all of them will be paused until the import is finished, even though there're only 4 threads in lightning are importing at the same time.

What I think may be improved is that, the number of regions that should be paused scheduling should equal to the number of import threads, which is 4 in the above example, instead of all 1024 regions. That may require to move the call of PauseSchedulersByKeyRange and SplitAndScatterRegionInBatches to the worker inside writeAndIngestByRanges.

We want this feature to be on lightning 6.5. Please let me know if you have any concern about this.

Is your feature request related to a problem? Please describe:

The reason we want to reduce the number of regions getting paused scheduling is that, our TiDB cluster is running on the k8s cluster. In the k8s cluster we have a process of node drainer which will periodically drainer a k8s node and then replace it with a new node. During the draining, it'll try to kill the TiKV pod and schedule it to another node. But if some regions on the TiKV are paused scheduling , TiKV can not evict leader and block the pod to be rescheduled to another node. If the lightning takes too long to import data, this whole process will be blocked for a long time. So we hope to reduce the number of regions getting paused scheduling in order to reduce the time of node drainer getting blocked.

Describe the feature you'd like:

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

@guoshouyan guoshouyan added the type/feature-request Categorizes issue or PR as related to a new feature. label Feb 19, 2024
@lance6716
Copy link
Contributor

lance6716 commented Feb 19, 2024

the number of regions that should be paused scheduling should equal to the number of import threads, which is 4 in the above example

The problem is new regions are empty and will be merged after split-merge-interval. In your proposal this global setting should also be increased to at least the duration of import task. It can be dynamically changed but the specific value can
only be decided by experience https://docs.pingcap.com/tidb/stable/dynamic-config#modify-pd-configuration-dynamically.

But if some regions on the TiKV are paused scheduling , TiKV can not evict leader and block the pod to be rescheduled to another node

If lightning uses tikv-importer.pause-pd-scheduler-scope = "global", the pause schedule logic will change to

if rc.cfg.TikvImporter.PausePDSchedulerScope == config.PausePDSchedulerScopeGlobal {
logTask.Info("pause pd scheduler of global scope")
restoreFn, err = rc.taskMgr.CheckAndPausePdSchedulers(ctx)
if err != nil {
return errors.Trace(err)
}
}

And the new pause logic should not block evicting leader. Please be aware that this is a global level pausing thus will affect all regions.

Similarly, we can develop PD to add a new kind of key range pausing that allow evicting leader.

@guoshouyan
Copy link
Contributor Author

guoshouyan commented Feb 20, 2024

Thanks for the quick response @lance6716 !

The problem is new regions are empty and will be merged after split-merge-interval

Let's say if I have 4 regions and 1 worker thread. Instead of split 4 regions and stop scheduler for all of them, I split and stop scheduler for 1 region at a time. Once I finish importing one region, I split and stop scheduler for another region. Will it work? I don't need to worry about empty region getting merged right?

And the new pause logic should not block evicting leader

Could you explain more about this? From the code looks like it'll still pause schedulers, why will it not block evicting leader?

@lance6716
Copy link
Contributor

Once I finish importing one region, I split and stop scheduler for another region

It sounds OK generally and need to check details later. But the main drawback of this design is that we don't dare to pick such a large code change to release branches like v6.5. I suggest we choose other workaround.

why will it not block evicting leader?

PauseSchedulersByKeyRange will add a "deny" rule on the key range

tidb/br/pkg/pdutil/pd.go

Lines 681 to 701 in 1fc92b3

func pauseSchedulerByKeyRangeWithTTL(
ctx context.Context,
pdHTTPCli pdhttp.Client,
startKey, endKey []byte,
ttl time.Duration,
) (<-chan struct{}, error) {
rule := &pdhttp.LabelRule{
ID: uuid.New().String(),
Labels: []pdhttp.RegionLabel{{
Key: "schedule",
Value: "deny",
TTL: ttl.String(),
}},
RuleType: "key-range",
// Data should be a list of KeyRangeRule when rule type is key-range.
// See https://github.com/tikv/pd/blob/783d060861cef37c38cbdcab9777fe95c17907fe/server/schedule/labeler/rules.go#L169.
Data: []KeyRangeRule{{
StartKeyHex: hex.EncodeToString(startKey),
EndKeyHex: hex.EncodeToString(endKey),
}},
}

And "global" option will remove schedulers

tidb/br/pkg/pdutil/pd.go

Lines 89 to 100 in 1fc92b3

// Schedulers represent region/leader schedulers which can impact on performance.
Schedulers = map[string]struct{}{
"balance-leader-scheduler": {},
"balance-hot-region-scheduler": {},
"balance-region-scheduler": {},
"shuffle-leader-scheduler": {},
"shuffle-region-scheduler": {},
"shuffle-hot-region-scheduler": {},
"evict-slow-store-scheduler": {},
}

So they have different behaviours, you can ask PD repo to learn more details.

@guoshouyan
Copy link
Contributor Author

Hi @lance6716, going back to my first proposal, we have another idea: Since pd will only merge newly split region after split-merge-interval, I could only move the PauseSchedulersByKeyRange to the worker thread and not move the SplitAndScatterRegionInBatches. I need to make sure the import time is less than the split-merge-interval so the region will not be merged before the import. What do you think?

@guoshouyan
Copy link
Contributor Author

Above is a short term solution that we want to make in 6.5. For long term solution we think it may need a pd change which we can follow up later. Also another question: ideally do we need to stop leader eviction during lightning import? Maybe we just need to stop the region merge scheduler?

@lance6716
Copy link
Contributor

move the PauseSchedulersByKeyRange to the worker thread and not move the SplitAndScatterRegionInBatches. I need to make sure the import time is less than the split-merge-interval so the region will not be merged before the import

LGTM but we still need tests to see if the behaviour is expected. We can add a option like pause-pd-scheduler-scope = "region-in-action" or something. PTAL @D3Hunter

And for the long term solution you can open another issue in PD repo.

@rleungx
Copy link
Member

rleungx commented Feb 27, 2024

Above is a short term solution that we want to make in 6.5. For long term solution we think it may need a pd change which we can follow up later. Also another question: ideally do we need to stop leader eviction during lightning import? Maybe we just need to stop the region merge scheduler?

We can allow evicting leader when using PauseSchedulersByKeyRange on the PD side later. You can track the issue tikv/pd#7853.

@guoshouyan
Copy link
Contributor Author

guoshouyan commented Mar 21, 2024

move the PauseSchedulersByKeyRange to the worker thread and not move the SplitAndScatterRegionInBatches. I need to make sure the import time is less than the split-merge-interval so the region will not be merged before the import

LGTM but we still need tests to see if the behaviour is expected. We can add a option like pause-pd-scheduler-scope = "region-in-action" or something. PTAL @D3Hunter

And for the long term solution you can open another issue in PD repo.

Hi @lance6716 , I raise a draft PR about the short term solution of adding an option like pause-pd-scheduler-scope. Please take a look. If it makes sense to you, I'll add more tests to it later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/feature-request Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants