Use deterministic deployment names to encode shard topology by ysqyang · Pull Request #81 · valkey-io/valkey-operator

ysqyang · 2026-02-09T17:43:41Z

Context

#55. This is a precursor to the resharding PR: #60

What changed

Deployments are now deterministically named (<cluster>-<N>-<M>, e.g. mycluster-0-0) instead of using GenerateName with random suffixes.
Each Deployment and its pod template are stamped with valkey.io/shard-index and valkey.io/node-index labels at creation time. By convention, node 0 is the initial primary and nodes 1+ are replicas.
addValkeyNode reads the pod's node-index label (via podRoleAndShard) to determine whether a pending node should become a primary (assign slots) or a replica (CLUSTER REPLICATE to the matching shard's primary).
A new pickPendingNode helper ensures primaries are processed before replicas, since CLUSTER REPLICATE requires the primary to already have slots.
Slot-assignment and replication logic are extracted into dedicated assignSlotsToNewPrimary and replicateToShardPrimary functions.

Why

The old code guessed node roles by counting existing shards — if fewer shards existed than desired, the next pending node became a primary; otherwise it became a replica. This heuristic was fragile: it could misassign roles when pods started out of order or when gossip hadn't yet propagated topology changes. Labels make the intended role explicit at Deployment creation time, removing all guesswork from the reconciler.

Deterministic names make it easy to correlate pods, deployments, and Valkey nodes during debugging (kubectl get pods immediately shows which shard and position each pod belongs to), and make upsertDeployments idempotent — it simply tries to create each name and ignores AlreadyExists.

Testing

Deployments and their pod templates are now stamped with valkey.io/shard-index and valkey.io/role labels when created by upsertDeployments. addValkeyNode reads these labels to determine whether a pending node should become a primary (assign slots) or a replica (CLUSTER REPLICATE to the matching shard's primary), removing the heuristic that previously guessed roles by counting existing shards. Signed-off-by: yang.qiu <yang.qiu@reddit.com>

internal/controller/deployment.go

internal/controller/valkeycluster_controller.go

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

internal/controller/valkeycluster_controller.go

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

internal/controller/deployment.go

internal/controller/valkeycluster_controller.go

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

bjosv · 2026-02-11T11:54:46Z

internal/controller/deployment.go

+//	<cluster>-shard<N>-<M>    e.g. "mycluster-shard0-0", "mycluster-shard1-2"
+//
+// By convention, node 0 is the initial primary and nodes 1, 2, … are replicas.
+// The name deliberately avoids "primary"/"replica" because failover can swap


I also think its good that we avoid primary/replica in the name.
We only have to keep in mind, as stated here, that after a failure the primary can be at index 2 (of 0-2).
If the user want to scale down replicas either the replica at index 0 or index 1 needs to go, i.e. there can be indexes missing and the index doesn't say that much anymore.

In StatefulSets indexes are used since it's always the last index/replica that is removed, but in Deployments, where any replica can be removed (depending of other factors), they have chosen to use a unique name/hash instead.
I hesitate a bit to use "index" instead of a random suffix, but I understand the reason for using it..

We only have to keep in mind, as stated here, that after a failure the primary can be at index 2 (of 0-2).
If the user want to scale down replicas either the replica at index 0 or index 1 needs to go, i.e. there can be indexes missing and the index doesn't say that much anymore.

In this scenario, I would expect a failover to either index 0 or 1, and index 2 is removed.

Failing logic is going to be addressed in a future PR

bjosv · 2026-02-11T12:05:06Z

internal/controller/deployment.go

+// logical node in the Valkey cluster. The Deployment name encodes the shard
+// and node index (e.g. "mycluster-shard0-0", "mycluster-shard1-2"). Two
+// labels (valkey.io/shard-index and valkey.io/node-index) provide selector
+// uniqueness so no two Deployments fight over the same Pod.


Not sure about the "fight" comment here, how does two Deployments fight over a pod?

I believe it means that a Deployment selector will always have a 1:1 mapping to a pod (singleton Deployments).

Without valkey.io/shard-index and valkey.io/node-index, all deployments have overlapping selectors, which means each Deployment's controller will independently count the matching Pods and try to reconcile to its desired replica count. So if Deployment A (replicas=1) and Deployment B (replicas=1) both select the same 2 Pods, each thinks "I want 1 but see 2" and tries to delete one. Then the other sees 0 and creates one. They'd thrash back and forth.

internal/controller/utils.go

bjosv · 2026-02-11T12:15:40Z

internal/controller/utils.go

+		return "", -1
+	}
+	if nodeIndex == 0 {
+		return RolePrimary, shardIndex


Is this function only expected to be used on pending nodes not yet add?
I guess so since this is only correct until there is an failover.
We could also label deployments with its expected role, i.e not using index but some label describing expected cluster role.

We could also label deployments with its expected role, i.e not using index but some label describing expected cluster role.

I was thinking along this line in the below comment. We may want to query for pods where a failback is due, and so querying for desired-replication-role vs current-replication-role may be useful.

#81 (comment)

This is only used on pending nodes for now. After a node has been bootstrapped (slots assigned, replicas attached), it moves out of PendingNodes into state.Shards and this function is never called for it again.

After a pod-failure with a primary Valkey node, a newly created pod will also be a pending node.
Since a pod with nodeIndex=1 might have take on that role in this shard nodeIndex=0 doesn't mean that is should become a primary, i.e in most failure cases the name will not be deterministic.
I guess we also need to check the current topology before adding this node.

internal/controller/valkeycluster_controller.go

bjosv · 2026-02-11T12:22:07Z

internal/controller/valkeycluster_controller.go

+//  2. Ensure the ConfigMap with valkey.conf and health-check scripts exists
+//     (upsertConfigMap).
+//  3. Ensure one Deployment per (shard, role) pair exists, each named
+//     deterministically (e.g. mycluster-shard0-primary) and labelled with


mycluster-shard0-primary here is probably an old iteration, but I'm in favor of using a label that could be updated after a failover. These need to be updated in the reconciliation loop.
If we go for using index, are we aiming to force the primary to be placed at index 0, i.e. failback to index 0 as soon as possible?

Thanks for catching this - this comment needs to be updated. We're no longer putting roles (primary vs replica) in deployments' or pods' names or labels, just shard indexes and node indexes. That said, I’m not planning to add controller logic to force the primary to be placed at index 0, as the benefits seem minimal relative to the added complexity. Users are expected to know that node-index = 0 does not always mean "primary".

internal/controller/utils.go

jdheyburn · 2026-02-11T16:07:24Z

internal/controller/utils.go

+	return parseRoleAndShard(pods.Items[idx].Name)
+}
+
+// parseRoleAndShard extracts the role and shard index from a pod name like
+// "mycluster-shard2-0-abc12-xyz". Node index 0 → primary, 1+ → replica.
+func parseRoleAndShard(podName string) (string, int) {


Since we already have the pod from the calling parent, couldn't we extract the labels on the pod to get the role and shard-index, instead of having to parse the pod name?

bjosv · 2026-02-13T10:00:15Z

I tested the PR and I think the naming is good. I'm onboard with that.
One problem I see when testing the PR is the expectation of that the primary is always on index 0.
When I kill the pod with index 0 (initially the primary) the replica is taking over as primary. When the operator creates a new deployment the newly a created pod is added as a primary instead of becoming a replica.
Before this change it was added as replica. Is this expected?

NAMESPACE                NAME                                                     READY   STATUS    RESTARTS   AGE
default                  pod/valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb       2/2     Running   0          28s
default                  pod/valkeycluster-sample-shard0-1-7484b74f7d-f2spj       2/2     Running   0          6m9s
default                  pod/valkeycluster-sample-shard1-0-775d7dfbbb-tn5jg       2/2     Running   0          11m
...

---- valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb  10.244.0.14  cluster_state:ok
201ade93737726ee50e97288d88405c9e3a6c0a8 10.244.0.12:6379@16379 master - 0 1770975863088 7 connected 0-5460
c705d86c46105658988c6780564709c49c76d576 10.244.0.9:6379@16379 slave 835d2d404f685b6afa8c3bdfa5698a8528dd385a 0 1770975862281 1 connected
ea9a2e0efe345ec2ba15adcf210e64491f4780ba 10.244.0.11:6379@16379 master - 0 1770975863088 0 connected 10922-16383
835d2d404f685b6afa8c3bdfa5698a8528dd385a 10.244.0.8:6379@16379 master - 0 1770975862686 1 connected 5461-10921
824c398d0eec928d570d824215ebe96e3811d2d8 10.244.0.14:6379@16379 myself,master - 0 0 8 connected
f493e8b32e40848a17fd20702a8811b322ea4e6b 10.244.0.10:6379@16379 slave ea9a2e0efe345ec2ba15adcf210e64491f4780ba 0 1770975862281 0 connected

Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests..

jdheyburn · 2026-02-13T13:41:23Z

My preference is that we remove shard from the pod name. We don't have pod_index<N>, that much is learned already. I think simplifying down to valkeycluster-sample-0-0 would be sufficient.

jdheyburn · 2026-02-13T13:42:34Z

I tested the PR and I think the naming is good. I'm onboard with that. One problem I see when testing the PR is the expectation of that the primary is always on index 0. When I kill the pod with index 0 (initially the primary) the replica is taking over as primary. When the operator creates a new deployment the newly a created pod is added as a primary instead of becoming a replica. Before this change it was added as replica. Is this expected?
NAMESPACE                NAME                                                     READY   STATUS    RESTARTS   AGE
default                  pod/valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb       2/2     Running   0          28s
default                  pod/valkeycluster-sample-shard0-1-7484b74f7d-f2spj       2/2     Running   0          6m9s
default                  pod/valkeycluster-sample-shard1-0-775d7dfbbb-tn5jg       2/2     Running   0          11m
...
---- valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb  10.244.0.14  cluster_state:ok
201ade93737726ee50e97288d88405c9e3a6c0a8 10.244.0.12:6379@16379 master - 0 1770975863088 7 connected 0-5460
c705d86c46105658988c6780564709c49c76d576 10.244.0.9:6379@16379 slave 835d2d404f685b6afa8c3bdfa5698a8528dd385a 0 1770975862281 1 connected
ea9a2e0efe345ec2ba15adcf210e64491f4780ba 10.244.0.11:6379@16379 master - 0 1770975863088 0 connected 10922-16383
835d2d404f685b6afa8c3bdfa5698a8528dd385a 10.244.0.8:6379@16379 master - 0 1770975862686 1 connected 5461-10921
824c398d0eec928d570d824215ebe96e3811d2d8 10.244.0.14:6379@16379 myself,master - 0 0 8 connected
f493e8b32e40848a17fd20702a8811b322ea4e6b 10.244.0.10:6379@16379 slave ea9a2e0efe345ec2ba15adcf210e64491f4780ba 0 1770975862281 0 connected
Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests..

@bjosv Could we fix this in a follow up PR?

bjosv · 2026-02-13T14:23:10Z

@bjosv Could we fix this in a follow up PR?

Sure.

We should update the top headline for the PR (that becomes the commit message).
I believe the following text was a previous iteration "addValkeyNode reads these labels to determine..".

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

ysqyang · 2026-02-13T23:40:55Z

I tested the PR and I think the naming is good. I'm onboard with that. One problem I see when testing the PR is the expectation of that the primary is always on index 0. When I kill the pod with index 0 (initially the primary) the replica is taking over as primary. When the operator creates a new deployment the newly a created pod is added as a primary instead of becoming a replica. Before this change it was added as replica. Is this expected?
NAMESPACE                NAME                                                     READY   STATUS    RESTARTS   AGE
default                  pod/valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb       2/2     Running   0          28s
default                  pod/valkeycluster-sample-shard0-1-7484b74f7d-f2spj       2/2     Running   0          6m9s
default                  pod/valkeycluster-sample-shard1-0-775d7dfbbb-tn5jg       2/2     Running   0          11m
...
---- valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb  10.244.0.14  cluster_state:ok
201ade93737726ee50e97288d88405c9e3a6c0a8 10.244.0.12:6379@16379 master - 0 1770975863088 7 connected 0-5460
c705d86c46105658988c6780564709c49c76d576 10.244.0.9:6379@16379 slave 835d2d404f685b6afa8c3bdfa5698a8528dd385a 0 1770975862281 1 connected
ea9a2e0efe345ec2ba15adcf210e64491f4780ba 10.244.0.11:6379@16379 master - 0 1770975863088 0 connected 10922-16383
835d2d404f685b6afa8c3bdfa5698a8528dd385a 10.244.0.8:6379@16379 master - 0 1770975862686 1 connected 5461-10921
824c398d0eec928d570d824215ebe96e3811d2d8 10.244.0.14:6379@16379 myself,master - 0 0 8 connected
f493e8b32e40848a17fd20702a8811b322ea4e6b 10.244.0.10:6379@16379 slave ea9a2e0efe345ec2ba15adcf210e64491f4780ba 0 1770975862281 0 connected
Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests..

We've establised that node index = 0 does not necessarily mean primary. Users are expected to be aware of the consequences of failover; 2. This implementation does not fully support failover which I believe we need a dedicated PR (or PR stack) for. The current PR is already big as is.

bjosv

LGTM!
Support of primary failure handling is to be re-added in separate PR.

ysqyang force-pushed the shard-labels branch 2 times, most recently from 22275a6 to b16818e Compare February 9, 2026 18:51

ysqyang force-pushed the shard-labels branch from b16818e to 217e91d Compare February 9, 2026 23:41

ysqyang marked this pull request as ready for review February 10, 2026 00:36

sandeepkunusoth reviewed Feb 10, 2026

View reviewed changes

internal/controller/deployment.go Outdated Show resolved Hide resolved

sandeepkunusoth reviewed Feb 10, 2026

View reviewed changes

internal/controller/valkeycluster_controller.go Outdated Show resolved Hide resolved

move podRoleAndShard to utils

6589748

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

sandeepkunusoth reviewed Feb 10, 2026

View reviewed changes

internal/controller/valkeycluster_controller.go Outdated Show resolved Hide resolved

make deployLabels and podLables use the same map

1e7a509

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

ysqyang force-pushed the shard-labels branch from 51c97fd to 1e7a509 Compare February 10, 2026 06:51

ysqyang requested a review from sandeepkunusoth February 10, 2026 07:42

ysqyang mentioned this pull request Feb 10, 2026

add scale-out/resharding logic #60

Draft

jdheyburn reviewed Feb 10, 2026

View reviewed changes

internal/controller/deployment.go Outdated Show resolved Hide resolved

internal/controller/valkeycluster_controller.go Outdated Show resolved Hide resolved

include shard-replicaId in deployment/pod

a52acd6

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

ysqyang force-pushed the shard-labels branch from 7dc6b59 to a52acd6 Compare February 11, 2026 07:28

ysqyang changed the title ~~label pods with shard-index and role at creation time~~ Add shard index and node index to deployment/pod names and labels Feb 11, 2026

ysqyang changed the title ~~Add shard index and node index to deployment/pod names and labels~~ Use deterministic deployment names to encode shard topology Feb 11, 2026

bjosv reviewed Feb 11, 2026

View reviewed changes

jdheyburn reviewed Feb 11, 2026

View reviewed changes

internal/controller/utils.go Outdated Show resolved Hide resolved

jdheyburn reviewed Feb 11, 2026

View reviewed changes

address PR comments

9343117

Signed-off-by: yang.qiu <yang.qiu@reddit.com>

jdheyburn approved these changes Feb 16, 2026

View reviewed changes

bjosv approved these changes Feb 16, 2026

View reviewed changes

bjosv merged commit 02eedaf into valkey-io:main Feb 16, 2026
4 checks passed

Comments

Conversation

ysqyang commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What changed

Why

Testing

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bjosv commented Feb 13, 2026

Uh oh!

jdheyburn commented Feb 13, 2026

Uh oh!

jdheyburn commented Feb 13, 2026

Uh oh!

bjosv commented Feb 13, 2026

Uh oh!

ysqyang commented Feb 13, 2026

Uh oh!

bjosv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ysqyang commented Feb 9, 2026 •

edited

Loading