-
Notifications
You must be signed in to change notification settings - Fork 955
Add support for Atomic Slot Migration to CLI #2755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #2755 +/- ##
============================================
- Coverage 72.48% 72.42% -0.06%
============================================
Files 128 128
Lines 70485 70710 +225
============================================
+ Hits 51088 51215 +127
- Misses 19397 19495 +98
🚀 New features to boost your workflow:
|
zuiderkwast
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM in general
src/valkey-cli.c
Outdated
| if (opts & CLUSTER_MANAGER_CMD_FLAG_USE_ATOMIC_SLOT_MIGRATION) { | ||
| /* Now that the migration is done, print all the #'s */ | ||
| printf("#"); | ||
| continue; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hehe, this is an atomic progress bar, completing atomically in one step. 😄
Would it make sense to track the progress in clusterManagerMoveSlotRangesASM and print some progress indicator based on the syncslots states or something? Maybe later? We can ignore it for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To maintain full compatibility - the progress should show a "." for each slot that is moved.
We could do something like map the migration state (e.g. snapshot, replicating, failing over) to a percentage (e.g. 0%, 33%, 66% respectively) then multiply by the migration slot count. But it would be semantically different than before, since you might see 100 "."s, but then have the migration fail at the last step and end up with no slots migrated.
I would prefer that we go with a new text UI entirely rather than working on the previous one. It would be nice if we had a general purpose CLI progress-bar (something that looks like https://cli.r-lib.org/reference/cli_progress_bar.html#basic-usage or similar) that we could use for this and other long running ops. But yeah, I would say lets tackle this problem separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'm OK with all the dots appearing atomically.
I don't think we strictly need to preserve the exact output with these dots though. Another option is to just skip the dots.
These text UIs are not that elaborate. I assume simplicity and no dependencies were prioritized.
enjoy-binbin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall LGTM.
src/valkey-cli.c
Outdated
| fflush(stdout); | ||
| sdsfree(to_print); | ||
| } | ||
| int print_dots = (opts & CLUSTER_MANAGER_OPT_VERBOSE), option_cold = (opts & CLUSTER_MANAGER_OPT_COLD), success = 1, in_progress = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does CLUSTER_MANAGER_OPT_COLD do in ASM? do we actually use cold in ASM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought cold was == dryrun, but it looks like it is actually just migrating the keys without moving the slots. There is no such concept for ASM, so I will make this option do nothing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
COLD is used in the valkey-cli --cluster fix mode, which can handle cases like a slot has two owners, a slot has zero owners, an aborted slot migration has left slots in multiple nodes, etc.
Maybe it's better to error out if someone tries to combine --cluster fix with --cluster-use-atomic-slot-migration?
|
@murphyjacob4 will you have some time to close on these details and get this merged? |
Yeah, apologies I lost track of this PR. Let me work on the feedback |
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
2287bef to
2339313
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good. It's very near. Only question is regarding COLD to avoid breaking some feature later when we make ASM enabled by default (in 10.0).
| while ((ln = listNext(&li)) != NULL) { | ||
| clusterManagerReshardTableItem *item = ln->value; | ||
| char *err; | ||
| if (opts & CLUSTER_MANAGER_OPT_USE_ATOMIC_SLOT_MIGRATION) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If COLD is used in this code path (by --cluster fix) it seems safer to skip ASM in this case.
| if (opts & CLUSTER_MANAGER_OPT_USE_ATOMIC_SLOT_MIGRATION) { | |
| if ((opts & CLUSTER_MANAGER_OPT_USE_ATOMIC_SLOT_MIGRATION) && | |
| !(opts & CLUSTER_MANAGER_OPT_COLD)) { |
Atomic + cold = Cold fusion. (It doesn't work.)
Adds a new option
--cluster-use-atomic-slot-migration. This will apply to both--cluster reshardand--cluster rebalancecommands.We could do some more optimizations here, but for now we batch all the slot ranges for one (source, target) pair and send them off as one
CLUSTER MIGRATESLOTSrequest. We then wait for this request to finish through pollingCLUSTER GETSLOTMIGRATIONSonce every 100ms. We parseCLUSTER GETSLOTMIGRATIONSand look for the most recent migration affecting the requested slot range, then check if it is in progress, failed, cancelled, or successful. If there is a failure or cancellation, we give this error to the user.Fixes #2504