Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BR restore txn report ErrRestoreInvalidRange error, 'startKey > endKey, endKey 0000000000000000f7', when restore region's endKey is "" #56228

Closed
SonglinLife opened this issue Sep 23, 2024 · 10 comments · Fixed by #56344
Labels
component/br This issue is related to BR of TiDB. severity/moderate type/bug The issue is confirmed as a bug.

Comments

@SonglinLife
Copy link
Contributor

SonglinLife commented Sep 23, 2024

Bug Report

Please answer these questions before submitting your issue. Thanks!

1. Minimal reproduce step (Required)

  1. TiKV has region endKey is ""
  2. backup txn (sucess)
  3. restore txn (failed)

2. What did you expect to see? (Required)

restore txn always success.

3. What did you see instead (Required)

restore txn failed with error, report startKey > endKey, endKey was 0000000000000000f7. Due to tikv encode rule, empty byte slice will encode as 0000000000000000f7. And also I checked my tikv cluster, it did have a region, which endKey was "".

Same Issue also seen inRestore txn kv fails and reports ErrRestoreInvalidRange #52574 , although this issue was resolved by pr 80d4dec.

but I find the br function SplitKeysAndScatter in br/pkg/restore/split/client.go
https://github.com/pingcap/tidb/blob/master/br/pkg/restore/split/client.go#L536-L566 also encode the lastKey without check the lastKey is empty slices. and then it call PaginateScanRegion, which throw the error.

I guess It was some issue like #52574

report error:

[2024/09/23 13:33:05.816 +08:00] [ERROR] [main.go:38] ["br failed"] [error="startKey > endKey, startKey: 63616665fd448403ff0800000000000000ff000000002aff0000fd, endkey: 0000000000000000f7: [BR:Common:ErrInvalidRange]invalid restore range"] [errorVerbose="[BR:Common:ErrInvalidRange]invalid restore range\nstartKey > endKey, startKey: 63616665fd448403ff0800000000000000ff000000002aff0000fd, endkey: 0000000000000000f7\ngithub.com/pingcap/tidb/br/pkg/restore/split.PaginateScanRegion\n\t/workspace/source/tidb/br/pkg/restore/split/split.go:100\ngithub.com/pingcap/tidb/br/pkg/restore/split.(*pdClient).SplitKeysAndScatter.func1\n\t/workspace/source/tidb/br/pkg/restore/split/client.go:566\ngithub.com/pingcap/tidb/br/pkg/utils.WithRetryReturnLastErr\n\t/workspace/source/tidb/br/pkg/utils/retry.go:94\ngithub.com/pingcap/tidb/br/pkg/restore/split.(*pdClient).SplitKeysAndScatter\n\t/workspace/source/tidb/br/pkg/restore/split/client.go:558\ngithub.com/pingcap/tidb/br/pkg/restore/internal/snap_split.(*RegionSplitter).executeSplitByKeys\n\t/workspace/source/tidb/br/pkg/restore/internal/snap_split/split.go:89\ngithub.com/pingcap/tidb/br/pkg/restore/internal/snap_split.(*RegionSplitter).executeSplitByRanges\n\t/workspace/source/tidb/br/pkg/restore/internal/snap_split/split.go:72\ngithub.com/pingcap/tidb/br/pkg/restore/internal/snap_split.(*RegionSplitter).ExecuteSplit\n\t/workspace/source/tidb/br/pkg/restore/internal/snap_split/split.go:48\ngithub.com/pingcap/tidb/br/pkg/restore/snap_client.(*SnapClient).SplitPoints\n\t/workspace/source/tidb/br/pkg/restore/snap_client/tikv_sender.go:352\ngithub.com/pingcap/tidb/br/pkg/task.RunRestoreTxn\n\t/workspace/source/tidb/br/pkg/task/restore_txn.go:91\nmain.runRestoreTxnCommand\n\t/workspace/source/tidb/br/cmd/br/restore.go:136\nmain.newTxnRestoreCommand.func1\n\t/workspace/source/tidb/br/cmd/br/restore.go:235\ngithub.com/spf13/cobra.(*Command).execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:983\ngithub.com/spf13/cobra.(*Command).ExecuteC\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1115\ngithub.com/spf13/cobra.(*Command).Execute\n\t/root/go/pkg/mod/github.com/spf13/cobra@v1.8.0/command.go:1039\nmain.main\n\t/workspace/source/tidb/br/cmd/br/main.go:36\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1650"] [stack="main.main\n\t/workspace/source/tidb/br/cmd/br/main.go:38\nruntime.main\n\t/usr/local/go/src/runtime/proc.go:267"]

4. What is your TiDB version? (Required)

Only Tikv and pd (v5.0.6)

br(v8.4.0-nigthly)

@lance6716
Copy link
Contributor

Only Tikv and pd (v5.0.6)

Please use LTS version

@lance6716
Copy link
Contributor

lance6716 commented Sep 23, 2024

I guess the problem is caused by

err = client.SplitPoints(ctx, getEndKeys(ranges), updateCh, false)

Here sortedSplitKeys is assigned from getEndKeys(ranges). End keys may not be a key to split region, especially it's "".

@lance6716
Copy link
Contributor

PTAL @Leavrth

@SonglinLife
Copy link
Contributor Author

Please use LTS version

thanks for your suggestion, We will upgrade as soon as we can.

Here sortedSplitKeys is assigned from getEndKeys(ranges). End keys may not be a key to split region, especially it's "".

I commented out this line and things work fine for me, it successfully restored the txn data. But I am not know the potential negative impact of removing the split code. Could there be any adverse effects?

@Leavrth
Copy link
Contributor

Leavrth commented Sep 23, 2024

I commented out this line and things work fine for me, it successfully restored the txn data. But I am not know the potential negative impact of removing the split code. Could there be any adverse effects?

This line will split regions to avoid region's data size too large. So if this line is commented, the region won't be split, and the data will be restored into one region. If the restored data size is not large, you can wait until regions are split automatically.

It's better to skip the empty EndKey in the function getEndKeys

func getEndKeys(ranges []rtree.RangeStats) [][]byte {
endKeys := make([][]byte, 0, len(ranges))
for _, rg := range ranges {
endKeys = append(endKeys, rg.EndKey)
}
return endKeys
}

func getEndKeys(ranges []rtree.RangeStats) [][]byte { 
 	endKeys := make([][]byte, 0, len(ranges)) 
 	for _, rg := range ranges { 
                 if len(rg.EndKey) == 0 {
                 	continue
                 }
 		endKeys = append(endKeys, rg.EndKey) 
 	} 
 	return endKeys 
 } 

@lance6716
Copy link
Contributor

Hi @SonglinLife do you have time to fix this problem?

@SonglinLife
Copy link
Contributor Author

Really sorry for the late reply.

Yes, I do want to resolve this bug. Does it only need to ignore the last key? Recently, I dug into the BR project and read the code, and it is hard for me to understand it.

I also see some other bugs in the BR project, like it retries backup but doesn't reset the progress bar.

loop.ProgressCallBack()

It really confused me at the beginning because I saw the backup progress bar reach 100% but it didn't stop the backup(it start a new round).

And I also struggle to figure out why the BR backup restarts rounds infinitely (starting 5 rounds). Can you give me some hints? I found in the BR code that it checks if there is an incomplete range. If none, then the main loop will stop.

if len(inCompleteRanges) == 0 {
// all range backuped
logutil.CL(ctx).Info("This round finished all backup ranges", zap.Uint64("round", round))
handleCancel()
mainCancel()
return nil

before start backup, It get range information by listdb.

dbs, err := m.ListDatabases()
if err != nil {
return nil, nil, nil, errors.Trace(err)
}
for _, dbInfo := range dbs {

But I use the TiKV and PD only, not with TiDB. So the first incomplete range is <"", "">. And the BR tree data structure will fill the incomplete range when TiKV backs up a region successfully. So if the first successful backup region is <a,b>, then the incomplete range will be <"", a> and <b, "">.

If TiKV have some gap between two adjacent regions, like <a, b> and <e,f>, the BR will think <c,e> as an incomplete region, and never stop the main loop.

I am a totally new user of TiDB, and it is really hard without your guys help. I do really want to improve the BR tools.

@lance6716
Copy link
Contributor

Yes, I do want to resolve this bug. Does it only need to ignore the last key? Recently, I dug into the BR project and read the code, and it is hard for me to understand it.

Only ignore the empty key (zero-length key). BR wants to split regions based on the ranges boundaries, but if a range use max key as end key, we can't split on it.

For rest problem, please open separate issues for them.

@SonglinLife
Copy link
Contributor Author

thinks for your reply, before I open a new issue I will read code file to understand more detail. In practice, we force br backup txn stop at round 2 and retore txn, it work fine on a prod tikv cluster. but there must be some thing unusual. yes, lets discuss in another issue.

and I also open a new request, for this issue base on this discussion.

@jebter jebter added the component/br This issue is related to BR of TiDB. label Sep 27, 2024
@Leavrth
Copy link
Contributor

Leavrth commented Sep 29, 2024

Yes, I do want to resolve this bug. Does it only need to ignore the last key? Recently, I dug into the BR project and read the code, and it is hard for me to understand it.

Yes, we use each range's end key as the new regions' boundary.

I also see some other bugs in the BR project, like it retries backup but doesn't reset the progress bar.

Actually, it retry based on the result of backed up ranges. In each round, it filter out the complete ranges.

iter := loop.GlobalProgressTree.Iter()
inCompleteRanges = iter.GetIncompleteRanges()
if len(inCompleteRanges) == 0 {
// all range backuped
logutil.CL(ctx).Info("This round finished all backup ranges", zap.Uint64("round", round))
handleCancel()
mainCancel()
return nil
}

It really confused me at the beginning because I saw the backup progress bar reach 100% but it didn't stop the backup(it start a new round).

The progress bar is imprecise -- approximate number of regions.

And I also struggle to figure out why the BR backup restarts rounds infinitely (starting 5 rounds). Can you give me some hints? I found in the BR code that it checks if there is an incomplete range. If none, then the main loop will stop.

Some incomplete ranges are still not backed up. Maybe there is a lock or other reasons.

If TiKV have some gap between two adjacent regions, like <a, b> and <e,f>, the BR will think <c,e> as an incomplete region, and never stop the main loop.

The region can be generated by split or merge command:

  1. split command: the region [a, f) is split by the split key e. So the new regions are the region [a, e) and the region [e, f).
  2. merge command: the region [a, e) and the region [e, f) is merged into the region [a, f).

Therefore, there must be the region [b, e)(or its sub regions such as [b, c) and [c, e]) between the region [a, b) and the region [e, f).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/br This issue is related to BR of TiDB. severity/moderate type/bug The issue is confirmed as a bug.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants