Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store/tikv:Ignore error and do gc anyway #5797

Merged
merged 18 commits into from
Feb 7, 2018

Conversation

wentaoxu
Copy link
Contributor

@wentaoxu wentaoxu commented Feb 5, 2018

as title says

@wentaoxu wentaoxu added the DNM label Feb 5, 2018
@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 5, 2018

PTAL @disksing @zhangjinpeng1987

regions++
} else {
errRegions++
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is errRegions need ++ when err != nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

break
}
}
log.Infof("[gc worker] %s finish gc, safePoint: %v, regions: %v, ignored regions: %v, cost time: %s", identifier, safePoint, regions, errRegions, time.Since(startTime))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can output this line of log at regular intervals (1min for example) during gc, so that we can know the GC progress.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@@ -148,7 +148,7 @@ const (
getMaxBackoff = 20000
prewriteMaxBackoff = 20000
cleanupMaxBackoff = 20000
GcMaxBackoff = 100000
GcOneRegionMaxBackoff = 20000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why change this value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think gc command can definitely returns in 20 secs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will change it for one request

return nil
}

func doGCForOneRegion(ctx goctx.Context, store tikv.Storage, safePoint uint64, identifier string, startKey []byte, endKey []byte) error {
Copy link
Contributor

@zhangjinpeng87 zhangjinpeng87 Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use loc instead of startKey and endKey. By the way, you don't need endKey indeed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will use loc, for endKey, see below

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func doGCForOneRegion(ctx goctx.Context, store tikv.Storage, safePoint uint64, identifier string, region RegionVerID) (RegionError, error)

key = loc.EndKey
if len(key) == 0 {
if len(key) == 0 || bytes.Compare(key, endKey) >= 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think L596-L599 should exist.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want the func doGCForOneRegion used to cleanup one region, if we remove this condition, the func is confused.


var key []byte
bo := tikv.NewBackoffer(tikv.GcOneRegionMaxBackoff, goctx.Background())
key := startKey
for {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the loop.

return nil
}

func doGCForOneRegion(ctx goctx.Context, store tikv.Storage, safePoint uint64, identifier string, startKey []byte, endKey []byte) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

func doGCForOneRegion(ctx goctx.Context, store tikv.Storage, safePoint uint64, identifier string, region RegionVerID) (RegionError, error)

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 6, 2018

return errors.Trace(err)
}
errRegions++
} else if regionErr != nil {
Copy link
Contributor

@zhangjinpeng87 zhangjinpeng87 Feb 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if regionErr != nil {
backoff
continue
}
if err != nil {
} else {
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to check err first, because if regionErr and err both exist, err should have higher priority.

ticker := time.NewTicker(gcJobLogTickInterval)
defer ticker.Stop()

bo := tikv.NewBackoffer(tikv.GcLocalRegionMaxBackoff, goctx.Background())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does LocalRegionMaxBackoff mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will remove it

gcHistogram.WithLabelValues("do_gc").Observe(time.Since(startTime).Seconds())
return nil
}

func doGCForOneRegion(store tikv.Storage, safePoint uint64, loc *tikv.KeyLocation) (*errorpb.Error, error) {
req := &tikvrpc.Request{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems you only need RegionVerID.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

if gcResp.GetError() != nil {
return errors.Errorf("unexpected gc error: %s", gcResp.GetError())
} else {
regions++
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err != nil {
err++
}
if regionError != nil {
backoff
continue
}
if err == nil {
++}

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 6, 2018

PTAL @zhangjinpeng1987

@wentaoxu wentaoxu changed the title [DNM]Ignore error and do gc anyway Ignore error and do gc anyway Feb 6, 2018
@wentaoxu wentaoxu removed the DNM label Feb 6, 2018
@@ -515,63 +516,95 @@ func doGC(ctx goctx.Context, store tikv.Storage, safePoint uint64, identifier st
// Sleep to wait for all other tidb instances update their safepoint cache.
time.Sleep(gcSafePointCacheInterval)

log.Infof("[gc worker] %s start gc, safePoint: %v.", identifier, safePoint)
startTime := time.Now()
regions := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/regions/successRegions
s/errRegions/failedRegions

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 6, 2018

/run-all-tests

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 6, 2018

/run-all-tests

@ngaut
Copy link
Member

ngaut commented Feb 6, 2018

Please add tests.


bo := tikv.NewBackoffer(tikv.GcOneRegionMaxBackoff, ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initialize it in following for loop seems better.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to do because It should not be initialized when we meet region error.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it.

gcHistogram.WithLabelValues("do_gc").Observe(time.Since(startTime).Seconds())
return nil
}

// these two errors should not return together, for more, see the func 'doGC'
func doGCForOneRegion(bo *tikv.Backoffer, store tikv.Storage, req *tikvrpc.Request, region tikv.RegionVerID) (*errorpb.Error, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems we should move the construction of req into this function?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the completion, it should be move in here, but if we have 1M regions, this cost very large extra memory.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cost is acceptable. If i pass an request not gc, the behavior is not consistent with the function name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I will change it

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 6, 2018

PTAL @zhangjinpeng1987

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 6, 2018

@ngaut gc is hard to add test because of package dependency, I will refine the code and make it easier after this pr.

@ngaut
Copy link
Member

ngaut commented Feb 6, 2018

Please use gofail to simulate error.

@wentaoxu wentaoxu changed the title Ignore error and do gc anyway store/tikv:Ignore error and do gc anyway Feb 6, 2018
@disksing
Copy link
Contributor

disksing commented Feb 6, 2018

LGTM.

@disksing disksing added the status/LGT1 Indicates that a PR has LGTM 1. label Feb 6, 2018
@zhangjinpeng87
Copy link
Contributor

LGTM

@zhangjinpeng87 zhangjinpeng87 added the status/LGT2 Indicates that a PR has LGTM 2. label Feb 6, 2018
log.Infof("[gc worker] %s start gc, safePoint: %v.", identifier, safePoint)
startTime := time.Now()
regions := 0
successRegions := 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to add some metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 7, 2018

@ngaut ok, I will do it

@ngaut ngaut removed the status/LGT1 Indicates that a PR has LGTM 1. label Feb 7, 2018
@@ -222,7 +224,7 @@ func (w *GCWorker) leaderTick(ctx goctx.Context) error {
}

w.gcIsRunning = true
log.Infof("[gc worker] %s starts GC job, safePoint: %v", w.uuid, safePoint)
log.Infof("[gc worker] %s starts the whole job, safePoint: %v", w.uuid, safePoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I change the log because it is too similar to the log in the func doGC.

@wentaoxu
Copy link
Contributor Author

wentaoxu commented Feb 7, 2018

PTAL @ngaut

Copy link
Member

@ngaut ngaut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ngaut ngaut merged commit c6046b2 into pingcap:master Feb 7, 2018
wentaoxu added a commit to wentaoxu/tidb that referenced this pull request Feb 7, 2018
* ignore error when gc, and continue do gc next region
coocood pushed a commit that referenced this pull request Feb 7, 2018
* ignore error when gc, and continue do gc next region
@sre-bot sre-bot added the contribution This PR is from a community contributor. label Dec 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contribution This PR is from a community contributor. status/LGT2 Indicates that a PR has LGTM 2.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants