Skip to content
This repository has been archived by the owner on Jul 24, 2024. It is now read-only.

restore: split & scatter regions concurrently #1363

Closed
wants to merge 27 commits into from

Conversation

YuJuncen
Copy link
Collaborator

@YuJuncen YuJuncen commented Jul 19, 2021

What problem does this PR solve?

Before, when restoring many small tables, the batcher would probably send small batch due to the so called AutoCommit feature of the batcher. By this, we can make the split & scatter & restore worker more active.

But frequently send small batches isn't free. The split step is costly and I/O bounded for even small batches. For example, it costs about 3s to splitting 60 ranges, but restore those ranges typically costs only 1s. Then the restore worker get idle at most time. The restore hence has slowed down.

BR pipeline-2

What is changed and how it works?

Instead of using a single split worker, this PR allow multi restore batches be split concurrently.
We added two hidden flags, --batch-flush-interval and --pd-concurrency, the former for better tuning the behavior of batcher, the latter for tweaking the concurrent split.
Also, more logs were added so the create table speed, download, ingest time cost can be observed via log.

Check List

Tests

  • Integration test
  • Manual test (add detailed scripts or steps below)
    A internal test shows, in a 190GB, 6000 tables workload, this PR can speed up the restoration: the original version takes over 2 hours for restoring, and this version takes about 30mins for restoring. The latter is nearly equal to the time cost of creating tables(see figure below).

image

Release note

  • Restore many small tables would be faster.

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

@YuJuncen YuJuncen changed the title restore: concurrently split restore: split & scatter regions concurrently Jul 19, 2021
@YuJuncen
Copy link
Collaborator Author

/build

@ti-chi-bot ti-chi-bot added size/L and removed size/XL labels Jul 23, 2021
files := r.result.Files()
// There has been a worker in the `RestoreFiles` procedure.
// Spawning a raw goroutine won't make too many requests to TiKV.
go b.client.RestoreFilesAndThen(ctx, files, r.result.RewriteRules, b.updateCh, func(e error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about use errgroup.WithContext

updateCh glue.Progress,
) error {
ch := make(chan error)
go SplitRangesAndThen(ctx, client, ranges, rewriteRules, updateCh, func(err error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not

var ret error
SplitRangesAndThen(ctx, client, ranges, rewriteRules, updateCh, func(err error) {
     ret = err
})
return ret

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even I don't know why, this ruins the whole procedure.

Thankfully, after using errgroup, we don't need this change anymore. 😄

}
wg.(*sync.WaitGroup).Wait()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not use a single global WaitGroup?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We make a waitGroup for each tables, so once some table has been restored fully (i.e. the wait group for it was done), we can send it to checksumming immediately.

@kennytm
Copy link
Collaborator

kennytm commented Aug 25, 2021

closing this as it has been moved to pingcap/tidb#27034.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants