Skip to content

Conversation

@murphyjacob4
Copy link
Contributor

@murphyjacob4 murphyjacob4 commented Oct 20, 2025

Adds a new option --cluster-use-atomic-slot-migration. This will apply to both --cluster reshard and --cluster rebalance commands.

We could do some more optimizations here, but for now we batch all the slot ranges for one (source, target) pair and send them off as one CLUSTER MIGRATESLOTS request. We then wait for this request to finish through polling CLUSTER GETSLOTMIGRATIONS once every 100ms. We parse CLUSTER GETSLOTMIGRATIONS and look for the most recent migration affecting the requested slot range, then check if it is in progress, failed, cancelled, or successful. If there is a failure or cancellation, we give this error to the user.

Fixes #2504

@codecov
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 73.07692% with 70 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.42%. Comparing base (faac14a) to head (2339313).

Files with missing lines Patch % Lines
src/valkey-cli.c 73.07% 70 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #2755      +/-   ##
============================================
- Coverage     72.48%   72.42%   -0.06%     
============================================
  Files           128      128              
  Lines         70485    70710     +225     
============================================
+ Hits          51088    51215     +127     
- Misses        19397    19495      +98     
Files with missing lines Coverage Δ
src/valkey-cli.c 56.79% <73.07%> (+0.58%) ⬆️

... and 16 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM in general

src/valkey-cli.c Outdated
Comment on lines 7645 to 7649
if (opts & CLUSTER_MANAGER_CMD_FLAG_USE_ATOMIC_SLOT_MIGRATION) {
/* Now that the migration is done, print all the #'s */
printf("#");
continue;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hehe, this is an atomic progress bar, completing atomically in one step. 😄

Would it make sense to track the progress in clusterManagerMoveSlotRangesASM and print some progress indicator based on the syncslots states or something? Maybe later? We can ignore it for now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To maintain full compatibility - the progress should show a "." for each slot that is moved.

We could do something like map the migration state (e.g. snapshot, replicating, failing over) to a percentage (e.g. 0%, 33%, 66% respectively) then multiply by the migration slot count. But it would be semantically different than before, since you might see 100 "."s, but then have the migration fail at the last step and end up with no slots migrated.

I would prefer that we go with a new text UI entirely rather than working on the previous one. It would be nice if we had a general purpose CLI progress-bar (something that looks like https://cli.r-lib.org/reference/cli_progress_bar.html#basic-usage or similar) that we could use for this and other long running ops. But yeah, I would say lets tackle this problem separately.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I'm OK with all the dots appearing atomically.

I don't think we strictly need to preserve the exact output with these dots though. Another option is to just skip the dots.

These text UIs are not that elaborate. I assume simplicity and no dependencies were prioritized.

Copy link
Member

@enjoy-binbin enjoy-binbin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM.

src/valkey-cli.c Outdated
fflush(stdout);
sdsfree(to_print);
}
int print_dots = (opts & CLUSTER_MANAGER_OPT_VERBOSE), option_cold = (opts & CLUSTER_MANAGER_OPT_COLD), success = 1, in_progress = 0;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does CLUSTER_MANAGER_OPT_COLD do in ASM? do we actually use cold in ASM?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought cold was == dryrun, but it looks like it is actually just migrating the keys without moving the slots. There is no such concept for ASM, so I will make this option do nothing.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

COLD is used in the valkey-cli --cluster fix mode, which can handle cases like a slot has two owners, a slot has zero owners, an aborted slot migration has left slots in multiple nodes, etc.

Maybe it's better to error out if someone tries to combine --cluster fix with --cluster-use-atomic-slot-migration?

@enjoy-binbin enjoy-binbin added the release-notes This issue should get a line item in the release notes label Nov 10, 2025
@zuiderkwast zuiderkwast added the to-be-merged Almost ready to merge label Nov 26, 2025
@zuiderkwast
Copy link
Contributor

@murphyjacob4 will you have some time to close on these details and get this merged?

@murphyjacob4
Copy link
Contributor Author

will you have some time to close on these details and get this merged?

Yeah, apologies I lost track of this PR. Let me work on the feedback

Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Signed-off-by: Jacob Murphy <jkmurphy@google.com>
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good. It's very near. Only question is regarding COLD to avoid breaking some feature later when we make ASM enabled by default (in 10.0).

while ((ln = listNext(&li)) != NULL) {
clusterManagerReshardTableItem *item = ln->value;
char *err;
if (opts & CLUSTER_MANAGER_OPT_USE_ATOMIC_SLOT_MIGRATION) {
Copy link
Contributor

@zuiderkwast zuiderkwast Nov 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If COLD is used in this code path (by --cluster fix) it seems safer to skip ASM in this case.

Suggested change
if (opts & CLUSTER_MANAGER_OPT_USE_ATOMIC_SLOT_MIGRATION) {
if ((opts & CLUSTER_MANAGER_OPT_USE_ATOMIC_SLOT_MIGRATION) &&
!(opts & CLUSTER_MANAGER_OPT_COLD)) {

Atomic + cold = Cold fusion. (It doesn't work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

release-notes This issue should get a line item in the release notes to-be-merged Almost ready to merge

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

[NEW] Use atomic slot migration in valkey-cli

3 participants