Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disttask: fix subtask finished immediately and mark success when encountering network partition #48660

Merged
merged 10 commits into from
Nov 17, 2023

Conversation

ywqzzy
Copy link
Contributor

@ywqzzy ywqzzy commented Nov 17, 2023

What problem does this PR solve?

Issue Number: close #48649 ref #46258

Problem Summary:
see the comments in issue.

What is changed and how it works?

  1. change taskTable interface using ctx
  2. when updating subtasks, check the exec_id of them
    Most of the code are editing tests.
    Main logic modifications only occur in task_table.go and scheduler.go

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No need to test
    • I checked and no code files have been changed.

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Documentation

  • Affects user behaviors
  • Contains syntax changes
  • Contains variable changes
  • Contains experimental features
  • Changes MySQL compatibility

Release note

Please refer to Release Notes Language Style Guide to write a quality release note.

None

@ti-chi-bot ti-chi-bot bot added do-not-merge/needs-linked-issue do-not-merge/needs-tests-checked release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Nov 17, 2023
Copy link

tiprow bot commented Nov 17, 2023

Hi @ywqzzy. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ywqzzy ywqzzy changed the title disttask: fix subtask finished immediately and mark success when encountering network partition [WIP]disttask: fix subtask finished immediately and mark success when encountering network partition Nov 17, 2023
@ti-chi-bot ti-chi-bot bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2023
Copy link

codecov bot commented Nov 17, 2023

Codecov Report

Merging #48660 (9ba6bf5) into master (657f0d9) will increase coverage by 1.5745%.
Report is 5 commits behind head on master.
The diff coverage is 50.9090%.

Additional details and impacted files
@@               Coverage Diff                @@
##             master     #48660        +/-   ##
================================================
+ Coverage   71.0824%   72.6570%   +1.5745%     
================================================
  Files          1365       1389        +24     
  Lines        404163     410899      +6736     
================================================
+ Hits         287289     298547     +11258     
+ Misses        96925      93463      -3462     
+ Partials      19949      18889      -1060     
Flag Coverage Δ
integration 43.4927% <5.0980%> (?)
unit 71.0844% <61.7363%> (+0.0019%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
dumpling 53.9874% <ø> (ø)
parser ∅ <ø> (∅)
br 48.7936% <ø> (-4.2962%) ⬇️

@@ -216,15 +218,19 @@ func (s *BaseScheduler) run(ctx context.Context, task *proto.Task) (resErr error
if err := s.getError(); err != nil {
break
}
if ctx.Err() != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

runCtx is the cancelled ctx on startCancelCheck

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@@ -387,7 +393,7 @@ func (s *BaseScheduler) Rollback(ctx context.Context, task *proto.Task) error {

// We should cancel all subtasks before rolling back
for {
subtask, err := s.taskTable.GetFirstSubtaskInStates(s.id, task.ID, task.Step,
subtask, err := s.taskTable.GetFirstSubtaskInStates(ctx, s.id, task.ID, task.Step,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use rollbackCtx?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use rollbackCtx?

Not neccessary? Since the ctx is used right before rollbackctx?But ok to change it.

@@ -610,27 +615,29 @@ func (s *BaseScheduler) markSubTaskCanceledOrFailed(ctx context.Context, subtask
err := errors.Cause(err)
if ctx.Err() != nil && context.Cause(ctx) == ErrCancelSubtask {
logutil.Logger(s.logCtx).Warn("subtask canceled", zap.Error(err))
s.updateSubtaskStateAndError(subtask, proto.TaskStateCanceled, nil)
updateCtx := util.WithInternalSourceType(context.Background(), kv.InternalDistTask)
Copy link
Contributor

@D3Hunter D3Hunter Nov 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tidb might not be able to shutdown gracefully if we block here
deadline might not work in all case, maybe we can use parent ctx, ok for now,

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Copy link
Contributor

@D3Hunter D3Hunter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rest lgtm

@ti-chi-bot ti-chi-bot bot added the needs-1-more-lgtm Indicates a PR needs 1 more LGTM. label Nov 17, 2023
@D3Hunter
Copy link
Contributor

plz fix comments before merge

@ywqzzy ywqzzy changed the title [WIP]disttask: fix subtask finished immediately and mark success when encountering network partition disttask: fix subtask finished immediately and mark success when encountering network partition Nov 17, 2023
@ti-chi-bot ti-chi-bot bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 17, 2023
Copy link
Contributor

@GMHDBJD GMHDBJD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ti-chi-bot ti-chi-bot bot added lgtm and removed needs-1-more-lgtm Indicates a PR needs 1 more LGTM. labels Nov 17, 2023
Copy link

ti-chi-bot bot commented Nov 17, 2023

[LGTM Timeline notifier]

Timeline:

  • 2023-11-17 09:12:51.991737494 +0000 UTC m=+4413169.578847641: ☑️ agreed by D3Hunter.
  • 2023-11-17 10:25:42.683217689 +0000 UTC m=+4417540.270327834: ☑️ agreed by GMHDBJD.

@D3Hunter
Copy link
Contributor

ci failed

    job_test.go:612: 
        	Error Trace:	tests/realtikvtest/importintotest/job_test.go:612
        	            				src/runtime/asm_amd64.s:1650
        	Error:      	Target error should be in err chain:
        	            	expected: "context canceled"
        	            	in chain: "unexpected no source type context, if you see this error, the `RequestSourceTypeKey` is missing in your context"
        	Test:       	TestLoadRemote/TestKillBeforeFinish

@ywqzzy
Copy link
Contributor Author

ywqzzy commented Nov 17, 2023

ci failed

    job_test.go:612: 
        	Error Trace:	tests/realtikvtest/importintotest/job_test.go:612
        	            				src/runtime/asm_amd64.s:1650
        	Error:      	Target error should be in err chain:
        	            	expected: "context canceled"
        	            	in chain: "unexpected no source type context, if you see this error, the `RequestSourceTypeKey` is missing in your context"
        	Test:       	TestLoadRemote/TestKillBeforeFinish

fixed

@ywqzzy
Copy link
Contributor Author

ywqzzy commented Nov 17, 2023

Working on the ci fail

@ywqzzy
Copy link
Contributor Author

ywqzzy commented Nov 17, 2023

/lgtm

Copy link

ti-chi-bot bot commented Nov 17, 2023

@ywqzzy: you cannot LGTM your own PR.

In response to this:

/lgtm

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

ti-chi-bot bot commented Nov 17, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: D3Hunter, GMHDBJD, ywqzzy

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@ti-chi-bot ti-chi-bot bot added the approved label Nov 17, 2023
@ti-chi-bot ti-chi-bot bot merged commit 844ba42 into pingcap:master Nov 17, 2023
15 checks passed
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #48688.

ti-chi-bot pushed a commit to ti-chi-bot/tidb that referenced this pull request Nov 17, 2023
ti-chi-bot bot pushed a commit that referenced this pull request Nov 17, 2023
…untering network partition (#48660) (#48688)

ref #46258, close pingcap/tidb#48649
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved lgtm needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. release-note-none Denotes a PR that doesn't merit a release note. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants