Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

Pause schedulers instead of Remove schedulers #549

Closed
@kennytm

Description

Feature Request

Describe your feature request related problem:

Currently BR removes the balance-* and shuffle-* schedulers (#412) to preserve a static store/leader map to speed up backup and restore. However, if BR crashed these removals cannot be automatically recovered. Keeping these schedulers off for a long time severely reduces the cluster's performance, and is very hard to debug.

Describe the feature you'd like:

Since tikv/pd#1942 (3.1.0-beta.2, 4.0.0-beta), PD supported a "pause scheduler" API:

curl -X POST \
     http://127.0.0.1:2379/pd/api/v1/schedulers/balance-hot-region-scheduler \
     --data-binary '{"delay":300000000000}'

This will temporarily disable the balance-hot-region-scheduler for 300 seconds, after which the scheduler is turned back on. We can keep the scheduler disabled as long as we issue another Pause request before the expiry time.

This API is thus similar to GC-TTL and TiKV Import Mode, where a timer is needed to keep this setting alive. With the Pause API, the cluster will eventually return to normal even if BR crashed.

("Resume" is done by submitting a delay of 0)

Disadvantage: The dynamic config are not reverted by timeout. This may lead to even more subtle bugs.

Describe alternatives you've considered:

Teachability, Documentation, Adoption, Migration Strategy:

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions