Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: global runaway watch by system table and impl exector for query watch #45465

Merged

Conversation

CabinfeverB
Copy link
Contributor

@CabinfeverB CabinfeverB commented Jul 19, 2023

What problem does this PR solve?

Issue Number: ref #43691

Problem Summary:

What is changed and how it works?

  1. global runaway watch by system table

create table mysql.tidb_runaway_watch and table mysql.tidb_runaway_watch_done to persist runaway watch and deletion of runaway watch. And all TiDb servers read both tables periodically to sync runaway watch among TiDBs.

  1. impl exector for query watch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
    About runaway watch sync:
    Execute query on TiDB1, and then execute query on TiDB2
    image
    We can see TiDB1 returns error 8253, and TiDB2 returns error 8254. And mysql.tidb_runaway_watch has the record.
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@CabinfeverB CabinfeverB requested a review from a team as a code owner July 19, 2023 07:45
@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 19, 2023
@tiprow
Copy link

tiprow bot commented Jul 19, 2023

Hi @CabinfeverB. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@CabinfeverB CabinfeverB marked this pull request as draft July 19, 2023 07:46
@ti-chi-bot ti-chi-bot bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 19, 2023
@CabinfeverB
Copy link
Contributor Author

cc @nolouch @glorv @Connor1996

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2023
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot bot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 26, 2023
@CabinfeverB CabinfeverB marked this pull request as ready for review July 26, 2023 14:25
@ti-chi-bot ti-chi-bot bot removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/needs-linked-issue labels Jul 26, 2023
@CabinfeverB CabinfeverB changed the title domain: global runaway watch by system table domain: global runaway watch by system table and impl exector for query watch Jul 26, 2023
@CabinfeverB CabinfeverB changed the title domain: global runaway watch by system table and impl exector for query watch *: global runaway watch by system table and impl exector for query watch Jul 26, 2023
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
session/bootstrap.go Outdated Show resolved Hide resolved
rm.watchList.Set(key, record, ttl)
rm.queryLock.Unlock()
} else {
if rm.watchList.Get(key) == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be in lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is first check, because we generally believe that in most cases, we will not add a watch list to a key repeatedly

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Copy link
Member

@Connor1996 Connor1996 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot
Copy link

ti-chi-bot bot commented Jul 31, 2023

@Connor1996: adding LGTM is restricted to approvers and reviewers in OWNERS files.

In response to this:

LGTM

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.


force := false
// The manual record replaces the old record.
force = record.Source == ManualSource || record.Source == rm.serverID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the priority between manually added records and watched records?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

manually added records has higher priority. I will change it.

rm.addWatchList(record, ttl, force)
}

// RemoveWatch is used to remove watch items from system table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

misleading comment, this function only remove records from in-memory cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it does have ambiguity

func (rm *RunawayManager) markRunaway(resourceGroupName, originalSQL, planDigest string, action string, matchType RunawayMatchType, now *time.Time) {
source := rm.serverID
if len(source) > 128 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why need truncate here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the server IP in k8s will be domain name, so the length of it will be very long. And the length of feild is 128, do we need to change it to blob?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should set a field length that is long enough to store the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, let's set 512 same as analyze job

if r.ID > 0 {
return r.ID, nil
}
case err := <-do.runawaySyncer.doneChan:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if multiple add watches happen at the same time, how can you ensure you get the right message from this chan?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't ensure. If keys of records are different and it gets wrong message , It can't get watch and will try wo get watch later. If keys are same, refer to https://github.com/pingcap/tidb/pull/45465/files#r1277100060

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you only send a message to notifyChan only once, one routine receives another routine's return message and drops it, then the other routine is blocked, this is a more severe situation.

}

func (do *Domain) AddRunawayWatch(record *resourcegroup.QuarantineRecord) (int64, error) {
if err := do.handleRunawayWatch(record); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't get what you want to explain, but it's still clear that the current logic cannot ensure the procedure to be atomic, so it can deliver wrong result

return infoschema.ErrResourceGroupNotExists.GenWithStackByArgs(record.ResourceGroupName)
}
if record.Action == rmpb.RunawayAction_NoneAction {
if rg.RunawaySettings == nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if when the RunawaySettings is not nil when adding the record and then changed to none, then what is the action of the record

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This watch will be uselss but not be removed from watch list. You can see BeforeExecutor, If no match action, it will do nothing.

start_time TIMESTAMP NOT NULL,
end_time TIMESTAMP NOT NULL,
watch varchar(12) NOT NULL,
start_time datetime(6) NOT NULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So what is the benefit for this change, I didn't see any other table to use datetime to represent time

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
if item == nil {
return false, 0
}
return true, item.Action
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For a manual add item without action, the action is none, in this case you should use the setting's action instead. BTW, if we support let manual record reflect the change of query limit's action change, I would expect watch record reflect this change too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can see L461-463.
make sense, I will update it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot bot added needs-1-more-lgtm Indicates a PR needs 1 more LGTM. approved labels Jul 31, 2023
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@@ -2773,10 +2777,16 @@ func upgradeToVer171(s Session, ver int64) {
if ver >= version171 {
return
}
mustExecute(s, "ALTER TABLE mysql.tidb_runaway_queries CHANGE COLUMN `tidb_server` `tidb_server` varchar(512)")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to TestUpgradeVersionForResumeJob, I try to pass it by add two upgrade functions. Maybe it's not a good method, I just want to pass test first.

failpoint.Inject("FastRunawayGC", func() {
expiredDuration = time.Second * 1
})
expiredTime := time.Now().Add(-expiredDuration)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the timezone offset also impact this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now gc clean up loop only affects tidb_runaway_queries, it does not affect correctness.

Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
Signed-off-by: Cabinfever_B <cabinfeveroier@gmail.com>
@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Aug 1, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Aug 1, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: Connor1996, glorv, qw4990

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot
Copy link

ti-chi-bot bot commented Aug 1, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-07-31 15:15:35.61207922 +0000 UTC m=+112019.554427751: ☑️ agreed by qw4990.
  • 2023-08-01 04:39:35.271048607 +0000 UTC m=+160259.213397137: ☑️ agreed by glorv.

@CabinfeverB
Copy link
Contributor Author

/retest-required

@tiprow
Copy link

tiprow bot commented Aug 1, 2023

@CabinfeverB: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest-required

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot merged commit c46da07 into pingcap:master Aug 1, 2023
7 of 12 checks passed
@CabinfeverB CabinfeverB mentioned this pull request Aug 24, 2023
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants