-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
resource_control: publish runaway management rfc #44745
Merged
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Next
Next commit
publish runaway management rfc
Signed-off-by: Connor1996 <zbk602423539@gmail.com>
- Loading branch information
commit e592fa6e0b69abb203467e43afde4b235d55e752
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
# Runaway Queries Management | ||
|
||
- Author: [Connor1996](https://github.com/Connor1996) | ||
- Tracking Issue: | ||
- <https://github.com/pingcap/tidb/issues/43691> | ||
|
||
## Motivation | ||
|
||
Run-away queries are queries that consume more resources beyond user expectation. This could be caused by improper SQL statement, suboptimal plan. | ||
Runaway query can impact overall performance if they are not managed properly. We need to manage run-away queries effectively. Long-running operations should be identified and aborted. | ||
Currently, we already have the deadline mechanism pushed down to the TiKV layer that one coprocessor request would not execute in TiKV more than 60s by default. But a runaway query may not cost too much time on one single coprocessor request, thus the deadline mechanism can't help avoid run-away queries. In the meantime, deadlines can't be too small, otherwise, normal requests can be quickly aborted. | ||
|
||
## How to identify run-away queries? | ||
|
||
Runaway queries can adversely impact overall performance if they are not managed properly. Resource manager can take action when a query exceeds more than a specified amount of elapsed time. The elapsed time indicates the time of being processed, which excludes the waiting time. | ||
Differentiating run-away queries from queries that really need to perform a full table/index scan is hard. There is no absolute rule. So we just let users define the rule to identify run-away queries. They can twist it on their own needs. The criteria are only the execution time, at least at present. Maybe add more dimension later. | ||
TiKV would send back the scan detail in coprocessor responses. If the total elapsed time of the query exceeds the threshold, then it would be recognized as a run-away query(statement). | ||
|
||
## User Interface | ||
|
||
### Create/Alter runaway rule | ||
|
||
```sql | ||
CREATE/ALTER RESOURCE GROUP rg1 | ||
RU_PER_SEC = 100000 [ PRIORITY = (HIGH|MEDIUM|LOW) ] [BURSTABLE] | ||
[ QUERY_LIMIT = ( | ||
EXEC_ELAPSED = <#>, | ||
ACTION = (DRYRUN|COOLDOWN|KILL), | ||
[ WATCH = (EXACT|SIMILAR) DURATION <#> ]) | ||
]; | ||
``` | ||
|
||
Runaway rules are stored in resource group meta. | ||
|
||
### Show runaway rule | ||
|
||
```sql | ||
mysql> SELECT * FROM information_schema.resource_groups; | ||
+---------+------------+----------+-----------+-----------------------------------------------------------+ | ||
| NAME | RU_PER_SEC | PRIORITY | BURSTABLE | QUERY_LIMIT | | ||
+---------+------------+----------+-----------+-----------------------------------------------------------+ | ||
| default | 1000000 | MEDIUM | YES | NULL | | ||
| rg1 | 2000 | HIGH | YES | EXEC_ELAPSED=10s, ACTION=KILL, WATCH=EXACT, DURATION=10m | | ||
| rg2 | 1000 | HIGH | YES | EXEC_ELAPSED=10s, ACTION=KILL | | ||
+---------+------------+----------+-----------+-----------------------------------------------------------+ | ||
``` | ||
|
||
### Log | ||
|
||
These fields are included for each record: | ||
|
||
- TIME: sample time | ||
- RULE: "EXEC_ELAPSED" | ||
- ACTION: "DRYRUN"/"COOLDOWN"/"KILL" | ||
- MATCH: "RULE": identified by criteria, "WATCH": identified by matching queries | ||
- SQL_TEXT: text of the runaway sql | ||
- SQL_DIGEST: plan digest of the runaway sql | ||
|
||
#### Option1: persist in log file | ||
|
||
Runaway query records are stored in a dedicated, auto rotated, run-away log file, quite like slow log file. And the records can be also viewed by the admin table INFORMATION_SCHEMA.RUNAWAY_QUERIES which is a mapping to the run-away log file. | ||
|
||
#### Option2: persist in kv data (chosen) | ||
|
||
Print runaway query records in tidb log file, and also persist the records in KV data by the admin table `INFORMATION_SCHEMA.RUNAWAY_QUERIES`. The logs are persisted in batch with a flush interval config `runaway_queries_history_flush_interval`. Queries on the table `INFORMATION_SCHEMA.RUNAWAY_QUERIES` would return the results combining both in-memory and on-disk data. | ||
And an owner is elected among TiDBs to help clean the history records. The oldest record would be reserved for `runaway_queries_history_max_days`. | ||
|
||
## How to handle run-away queries? | ||
|
||
As you can see from the SQL interface, the actions are of three types: | ||
|
||
- DRYRUN: | ||
Only detect and record but do nothing | ||
|
||
- COOLDOWN: | ||
Once regarded as a run-away query, the later coprocessor requests will be deprioritized by setting requests context with the lowest priority. On TiKV side, the priority of the request would override the setting of the resource group. Note, we can't deprioritize for the already executing coprocessor requests for simplicity. As coprocessor paging is enabled by default, it won't affect too much. As for batch_coprocess which doesn't support paging, it would be a problem. But it's enabled by default, so let's ignore it currently. | ||
The override priority is passed in the resource control context of requests. | ||
|
||
```protobuf | ||
message ResourceControlContext { | ||
... | ||
// Override the priority of resource group for the request. | ||
uint64 override_priority = 3; | ||
} | ||
``` | ||
|
||
- KILL: | ||
The query is aborted with error killed as runaway query. The deadline of the request should be set with the minimum between default copr request deadline and the rest duration to reach runaway threshold to abort the TiKV requests as soon as possible. | ||
|
||
## How to quarantine run-away queries? | ||
|
||
Cancelling a runaway query is helpful to prevent wasting system resources, but if that problem query is run repeatedly, it could still result in a considerable amount of wasted resources. SQL quarantine solves this problem by quarantining cancelled SQL statements, so they can't be run multiple times. | ||
|
||
If `WATCH` clause is set, we will try to match the signature of the queries that are already marked as "runaway" in the coming N seconds. | ||
|
||
- `WATCH EXACT`: match statements by exact SQL text. | ||
- `WATCH SIMILAR`: match the statements by normalized SQL. | ||
- `DURATION <#>`: When the very first SQL query is treated as "runaway", we will mark matches SQL as "runaway" in the coming N seconds. | ||
|
||
For the later quarantined query, reject with the error `quarantined plan used` and logs in `mysql.RUNAWAY_QUERIES` as well. | ||
|
||
Meanwhile, we need to provide an admin table `mysql.QURANTINE_WATCH` to let users know which watches are now active for quarantining queries. | ||
|
||
These fields are included for each record: | ||
|
||
- START_TIME: The first time hit the watch rule | ||
- END_TIME: When the quarantine watch would finish | ||
- RULE: "EXACT"/"SIMILAR" | ||
- SQL_TEXT: text of the runaway sql | ||
- SQL_DIGEST: plan digest of the runaway sql |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current implementation is different from this description
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated