[BUG] slot stuck in importing state on replica after scale-up and rebalance

**Describe the bug**

While running a 2 shard cluster with 1 replica per shard I'm executing a scale-up action by bringing a new shard online, adding it to the cluster and then running a `valkey-cli --cluster rebalance  ...  --cluster-use-empty-masters` to evenly re-assign slots. Sometimes this process works without any issue but on multiple occasions the cluster has been left with one or more slots stuck in either an `importing` or `migrating` state.

Example `--cluster check` output after the rebalance command completed successfully:

```
127.0.0.1:6379 (1f47aff9...) -> 634702 keys | 5462 slots | 1 replicas.
100.64.172.55:6379 (cfe5e107...) -> 633472 keys | 5461 slots | 1 replicas.
100.64.92.212:6379 (f90d51f1...) -> 634590 keys | 5461 slots | 1 replicas.
[OK] 1902764 keys in 3 primaries.
116.14 keys per slot on average.
>>> Performing Cluster Check (using node 127.0.0.1:6379)
M: 1f47aff94857bb03d7932b01a53fddbfcb44da35 127.0.0.1:6379
   slots:[0-5461] (5462 slots) master
   1 additional replica(s)
S: 86c71d57486a1b3d7b26b2c71c4538c304e67432 100.64.3.150:6379
   slots: (0 slots) slave
   replicates 1f47aff94857bb03d7932b01a53fddbfcb44da35
M: cfe5e107fce3bca937f4c26e96fd8576b2db5c11 100.64.172.55:6379
   slots:[5462-8191],[10923-13653] (5461 slots) master
   1 additional replica(s)
S: 537c00be765db05a7c00b03fdbc149b2ffd728b0 100.64.64.34:6379
   slots: (0 slots) slave
   replicates f90d51f138c86ee0bada66afc89e2916da8ab849
M: f90d51f138c86ee0bada66afc89e2916da8ab849 100.64.92.212:6379
   slots:[8192-10922],[13654-16383] (5461 slots) master
   1 additional replica(s)
S: ba740172376b6be343d28079741a183d83d2fc5b 100.64.145.169:6379
   slots: (0 slots) slave
   replicates cfe5e107fce3bca937f4c26e96fd8576b2db5c11
[OK] All nodes agree about slots configuration.
>>> Check for open slots...
[WARNING] Node 100.64.3.150:6379 has slots in importing state 52.
[WARNING] The following slots are open: 52.
>>> Check slots coverage...
[OK] All 16384 slots covered.
```

In this case, I ran a `cluster setslot 52 stable` against the primary for replica 86c71d57486a1b3d7b26b2c71c4538c304e67432 and the open slot warning disappeared.


**To reproduce**

This does not reproduce for me every single time but I have only been able to reproduce it while running under an average amount of load. Testing in a local, empty cluster does not appear to yield the same failure scenario. The rough steps are:

1. Get a cluster set up with 2 shards, 1 replica per shard. No persistence.
2. Apply write load to the leaders and read load to the replicas
3. Bring a new shard online and ensure it's added to the cluster (no slots assigned initially)
4. Run a `valkey-cli --cluster rebalance  ...  --cluster-use-empty-masters` command to kick of the rebalance
5. After the rebalance is complete run a `valkey-cli --cluster check 127.0.0.1 6379` command to look for open slots

**Expected behavior**

The rebalance should complete successfully without leaving any slots open.

**Additional information**

While digging through the logs of the new primary that was brought online and used during the rebalance I found the following, which might be a contributing factor:

```
1:M 06 Sep 2024 18:55:32.847 * Importing slot 52 from node f90d51f138c86ee0bada66afc89e2916da8ab849 ()
1:M 06 Sep 2024 18:55:32.869 * Starting BGSAVE for SYNC with target: replicas sockets using: normal sync
1:M 06 Sep 2024 18:55:32.870 * Background RDB transfer started by pid 47 to pipe through parent process
1:M 06 Sep 2024 18:55:32.877 * Assigning slot 52 to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef
1:M 06 Sep 2024 18:55:32.885 * Importing slot 53 from node f90d51f138c86ee0bada66afc89e2916da8ab849 ()
```

The logs above are from the primary node that got brought online. The replica for this primary had the following logs at the same time:

```
1:S 06 Sep 2024 18:55:32.837 * Slot 51 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.869 * Full resync from primary: fb06637f3494b2fe80b9395374bf28e165c8f470:10756256
1:S 06 Sep 2024 18:55:32.878 * Slot 52 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.912 * Slot 53 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.955 * Slot 54 is migrated from node f90d51f138c86ee0bada66afc89e2916da8ab849 () in shard 1d2a9e8aa5b4937c47c48a96aa294da2f77a68bd to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef.
1:S 06 Sep 2024 18:55:32.961 * PRIMARY <-> REPLICA sync: Flushing old data
1:S 06 Sep 2024 18:55:32.961 * PRIMARY <-> REPLICA sync: Loading DB in memory
1:S 06 Sep 2024 18:55:32.967 * Loading RDB produced by Valkey version 7.9.240
1:S 06 Sep 2024 18:55:32.967 * RDB age 0 seconds
1:S 06 Sep 2024 18:55:32.967 * RDB memory usage when created 29.26 Mb
1:S 06 Sep 2024 18:55:33.006 * Done loading RDB, keys loaded: 6041, keys expired: 0.
1:S 06 Sep 2024 18:55:33.006 * PRIMARY <-> REPLICA sync: Finished with success
1:S 06 Sep 2024 18:55:33.011 * Assigning slot 55 to node 1f47aff94857bb03d7932b01a53fddbfcb44da35 () in shard aa8cd1bcc72ff94c0fb1cfb0d9c597cae22786ef
```


Perhaps I just need to wait until the replica is fully in sync before running a rebalance command? Or is this something that Valkey should be able to handle automatically?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG] slot stuck in importing state on replica after scale-up and rebalance #998

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] slot stuck in importing state on replica after scale-up and rebalance #998

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions