Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ddl: eliminate ingest step for add index with local engine (#47982) #48099

Merged

Conversation

ti-chi-bot
Copy link
Member

This is an automated cherry-pick of #47982

What problem does this PR solve?

Issue Number: close #47981

Problem Summary:

Previously, when tidb_enable_dist_task is enabled, the ingest step(step 3) is separated from read-index step(step 1). This cause the problem that if a TiDB crashes in ingest step, index data in local disk is lost. Because the disttask framework does not support changing step backward(like changing step 3 to step 1), it doesn't re-scan the lost index data. Finally, data inconsistency occurs.

What is changed and how it works?

Merge the ingest step to read-index step by Flush every time a subtask is finished.

Thus, the subtask will not be marked as succeed if ingest failed. replaceDeadNodesIfAny will re-distribute these running subtask to another TiDB instance when the lease is expired.

Check List

Tests

  • Unit test
  • Integration test

Before

Running Suite: ddl Suite
========================
Random Seed: 1698295051
Will run 1 of 4 specs

[2023/10/26 12:37:37.494 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:40.736 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:43.988 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:47.244 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:50.520 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:53.767 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:37:57.027 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:00.713 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:04.186 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:38:07.424 +08:00] [INFO] [disttask_test.go:103] ["log found"] [log="[\"[2023/10/26 12:38:03.578 +08:00] [Info] [backend.go:364] [\\\"import start\\\"] [engineTag=<import-and-reset>] [engineUUID=818fb05b-4542-5beb-907a-21cb49c99be8] [retryCnt=0]\\n\"]"]
[2023/10/26 12:38:07.424 +08:00] [INFO] [disttask_test.go:124] ["inject fault"] [chaosParams="{\"name\":\"\",\"faultType\":\"kill\",\"selector\":\"tidb(ddl-owner)\",\"selectorPolicy\":\"\",\"faultDuration\":60000000000,\"Spec\":null,\"SelectorPeersList\":null,\"Pitr\":null,\"TiCDC\":null,\"checkConfig\":{\"balanceCheck\":null,\"raftLogLagCheck\":null,\"raftLogGcCheck\":null},\"repeatExecTimes\":0}"]
[2023/10/26 12:38:07.665 +08:00] [INFO] [db.go:103] ["ADMIN SHOW DDL"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [opts.go:34] ["Chaos opts: {map[type:kill] [map[selectorPeers:[tc-tidb-0]]] 1m0s parallelly  0s}"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [run.go:81] ["tcType: *k8s.TiDBCluster"]
[2023/10/26 12:38:07.724 +08:00] [INFO] [chaos.go:297] ["init chaos"] [selector:="{\"selectorPeers\":[\"tc-tidb-0\"]}"] ["fault type:"=kill]
[2023/10/26 12:38:07.975 +08:00] [INFO] [chaos.go:203] ["fault will last for"] [duration=1m0s]
[2023/10/26 12:38:07.975 +08:00] [INFO] [chaos.go:64] ["Run chaos"] [name=kill] [selectors="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [selectorsRetainPolicy(selectors)="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="ContainerKillExperimentSpec{Scheduler: <nil>}"]
[mysql] 2023/10/26 12:38:08 packets.go:37: unexpected EOF
[2023/10/26 12:39:08.042 +08:00] [INFO] [chaos.go:216] ["chaosDo finish since fault duration reaches"]
[2023/10/26 12:39:08.042 +08:00] [INFO] [chaos.go:88] ["Clean chaos"] [name=kill] [chaosId="ns=testbed-tangenta-test-g27gc,kind=container-kill,name=container-kill-dxiyyvan,spec=&k8s.ChaosIdentifier{Namespace:\"testbed-tangenta-test-g27gc\", Name:\"container-kill-dxiyyvan\", Spec:ContainerKillExperimentSpec{Scheduler: <nil>}}"]
STEP: Start One Test
STEP: End One Test
• Failure [113.580 seconds]
disttask-add-index
/home/tangenta/endless/pkg/util/dsl.go:29
  run add index test
  /home/tangenta/endless/testcase/ddl/disttask_test.go:57
    fail on ingest #fail_on_ingest# [It]
    /home/tangenta/endless/pkg/util/dsl.go:61

    Expected
        <*mysql.MySQLError | 0xc000a8db00>: {
            Number: 8223,
            SQLState: [72, 89, 48, 48, 48],
            Message: "data inconsistency in table: sbtest1, index: idx, handle: 428304, index-values:\"\" != record-values:\"handle: 428304, values: [KindString 92149430868-57178916270-87020426646-90156921857-46807764443-77432155857-65114616205-78384108897-94777493229-87970275195]\"",
        }
    to be nil

After:

Running Suite: ddl Suite
========================
Random Seed: 1698295332
Will run 1 of 4 specs

[2023/10/26 12:42:18.608 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:21.859 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:25.121 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:28.386 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:31.774 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:35.029 +08:00] [INFO] [disttask_test.go:107] ["no match log, keep polling..."]
[2023/10/26 12:42:38.284 +08:00] [INFO] [disttask_test.go:103] ["log found"] [log="[\"[2023/10/26 12:42:36.659 +08:00] [Info] [backend.go:364] [\\\"import start\\\"] [engineTag=<import-and-reset>] [engineUUID=462b4eef-7a5c-5d2f-b4d3-35fd1b503f75] [retryCnt=0]\\n\"]"]
[2023/10/26 12:42:38.284 +08:00] [INFO] [disttask_test.go:124] ["inject fault"] [chaosParams="{\"name\":\"\",\"faultType\":\"kill\",\"selector\":\"tidb(ddl-owner)\",\"selectorPolicy\":\"\",\"faultDuration\":60000000000,\"Spec\":null,\"SelectorPeersList\":null,\"Pitr\":null,\"TiCDC\":null,\"checkConfig\":{\"balanceCheck\":null,\"raftLogLagCheck\":null,\"raftLogGcCheck\":null},\"repeatExecTimes\":0}"]
[2023/10/26 12:42:38.532 +08:00] [INFO] [db.go:103] ["ADMIN SHOW DDL"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [opts.go:34] ["Chaos opts: {map[type:kill] [map[selectorPeers:[tc-tidb-0]]] 1m0s parallelly  0s}"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [run.go:81] ["tcType: *k8s.TiDBCluster"]
[2023/10/26 12:42:38.588 +08:00] [INFO] [chaos.go:297] ["init chaos"] [selector:="{\"selectorPeers\":[\"tc-tidb-0\"]}"] ["fault type:"=kill]
[2023/10/26 12:42:38.864 +08:00] [INFO] [chaos.go:203] ["fault will last for"] [duration=1m0s]
[2023/10/26 12:42:38.864 +08:00] [INFO] [chaos.go:64] ["Run chaos"] [name=kill] [selectors="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [selectorsRetainPolicy(selectors)="[testbed-tangenta-test-g27gc/tc-tidb-0]"] [targetSelectors="[nil]"] [targetSelectorsRetainPolicy(targetSelectors)="[nil]"] [experimentSpec="ContainerKillExperimentSpec{Scheduler: <nil>}"]
[mysql] 2023/10/26 12:42:39 packets.go:37: unexpected EOF
[2023/10/26 12:43:38.937 +08:00] [INFO] [chaos.go:216] ["chaosDo finish since fault duration reaches"]
[2023/10/26 12:43:38.937 +08:00] [INFO] [chaos.go:88] ["Clean chaos"] [name=kill] [chaosId="ns=testbed-tangenta-test-g27gc,kind=container-kill,name=container-kill-mwvzyuvu,spec=&k8s.ChaosIdentifier{Namespace:\"testbed-tangenta-test-g27gc\", Name:\"container-kill-mwvzyuvu\", Spec:ContainerKillExperimentSpec{Scheduler: <nil>}}"]
• [SLOW TEST:95.758 seconds]
disttask-add-index
/home/tangenta/endless/pkg/util/dsl.go:29
  run add index test
  /home/tangenta/endless/testcase/ddl/disttask_test.go:57
    fail on ingest #fail_on_ingest#
    /home/tangenta/endless/pkg/util/dsl.go:61
------------------------------
SSS
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot ti-chi-bot added priority/P0 The issue has P0 priority. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-7.5 This PR is cherry-picked to release-7.5 from a source PR. labels Oct 30, 2023
@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Oct 30, 2023
@ti-chi-bot ti-chi-bot added the cherry-pick-approved Cherry pick PR approved by release team. label Oct 31, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 31, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ywqzzy, zimulala

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added approved lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Oct 31, 2023
@ti-chi-bot
Copy link

ti-chi-bot bot commented Oct 31, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-10-30 09:55:32.535483178 +0000 UTC m=+2860530.122593323: ☑️ agreed by ywqzzy.
  • 2023-10-31 07:43:11.067626145 +0000 UTC m=+2938988.654736275: ☑️ agreed by zimulala.

@ti-chi-bot ti-chi-bot bot merged commit 1702710 into pingcap:release-7.5 Oct 31, 2023
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved cherry-pick-approved Cherry pick PR approved by release team. lgtm priority/P0 The issue has P0 priority. release-note-none Denotes a PR that doesn't merit a release note. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. type/cherry-pick-for-release-7.5 This PR is cherry-picked to release-7.5 from a source PR.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants