Skip to content

Comments

Use deterministic deployment names to encode shard topology#81

Merged
bjosv merged 5 commits intovalkey-io:mainfrom
ysqyang:shard-labels
Feb 16, 2026
Merged

Use deterministic deployment names to encode shard topology#81
bjosv merged 5 commits intovalkey-io:mainfrom
ysqyang:shard-labels

Conversation

@ysqyang
Copy link
Contributor

@ysqyang ysqyang commented Feb 9, 2026

Context

#55. This is a precursor to the resharding PR: #60

What changed

  • Deployments are now deterministically named (<cluster>-<N>-<M>, e.g. mycluster-0-0) instead of using GenerateName with random suffixes.
  • Each Deployment and its pod template are stamped with valkey.io/shard-index and valkey.io/node-index labels at creation time. By convention, node 0 is the initial primary and nodes 1+ are replicas.
  • addValkeyNode reads the pod's node-index label (via podRoleAndShard) to determine whether a pending node should become a primary (assign slots) or a replica (CLUSTER REPLICATE to the matching shard's primary).
  • A new pickPendingNode helper ensures primaries are processed before replicas, since CLUSTER REPLICATE requires the primary to already have slots.
  • Slot-assignment and replication logic are extracted into dedicated assignSlotsToNewPrimary and replicateToShardPrimary functions.

Why

The old code guessed node roles by counting existing shards — if fewer shards existed than desired, the next pending node became a primary; otherwise it became a replica. This heuristic was fragile: it could misassign roles when pods started out of order or when gossip hadn't yet propagated topology changes. Labels make the intended role explicit at Deployment creation time, removing all guesswork from the reconciler.

Deterministic names make it easy to correlate pods, deployments, and Valkey nodes during debugging (kubectl get pods immediately shows which shard and position each pod belongs to), and make upsertDeployments idempotent — it simply tries to create each name and ignores AlreadyExists.

Testing

Screenshot 2026-02-13 at 9 19 04 AM

@ysqyang ysqyang force-pushed the shard-labels branch 2 times, most recently from 22275a6 to b16818e Compare February 9, 2026 18:51
Deployments and their pod templates are now stamped with
valkey.io/shard-index and valkey.io/role labels when created by
upsertDeployments. addValkeyNode reads these labels to determine
whether a pending node should become a primary (assign slots) or a
replica (CLUSTER REPLICATE to the matching shard's primary), removing
the heuristic that previously guessed roles by counting existing
shards.

Signed-off-by: yang.qiu <yang.qiu@reddit.com>
@ysqyang ysqyang marked this pull request as ready for review February 10, 2026 00:36
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
@ysqyang ysqyang changed the title label pods with shard-index and role at creation time Add shard index and node index to deployment/pod names and labels Feb 11, 2026
@ysqyang ysqyang changed the title Add shard index and node index to deployment/pod names and labels Use deterministic deployment names to encode shard topology Feb 11, 2026
// <cluster>-shard<N>-<M> e.g. "mycluster-shard0-0", "mycluster-shard1-2"
//
// By convention, node 0 is the initial primary and nodes 1, 2, … are replicas.
// The name deliberately avoids "primary"/"replica" because failover can swap
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also think its good that we avoid primary/replica in the name.
We only have to keep in mind, as stated here, that after a failure the primary can be at index 2 (of 0-2).
If the user want to scale down replicas either the replica at index 0 or index 1 needs to go, i.e. there can be indexes missing and the index doesn't say that much anymore.

In StatefulSets indexes are used since it's always the last index/replica that is removed, but in Deployments, where any replica can be removed (depending of other factors), they have chosen to use a unique name/hash instead.
I hesitate a bit to use "index" instead of a random suffix, but I understand the reason for using it..

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only have to keep in mind, as stated here, that after a failure the primary can be at index 2 (of 0-2).
If the user want to scale down replicas either the replica at index 0 or index 1 needs to go, i.e. there can be indexes missing and the index doesn't say that much anymore.

In this scenario, I would expect a failover to either index 0 or 1, and index 2 is removed.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Failing logic is going to be addressed in a future PR

// logical node in the Valkey cluster. The Deployment name encodes the shard
// and node index (e.g. "mycluster-shard0-0", "mycluster-shard1-2"). Two
// labels (valkey.io/shard-index and valkey.io/node-index) provide selector
// uniqueness so no two Deployments fight over the same Pod.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure about the "fight" comment here, how does two Deployments fight over a pod?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe it means that a Deployment selector will always have a 1:1 mapping to a pod (singleton Deployments).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without valkey.io/shard-index and valkey.io/node-index, all deployments have overlapping selectors, which means each Deployment's controller will independently count the matching Pods and try to reconcile to its desired replica count. So if Deployment A (replicas=1) and Deployment B (replicas=1) both select the same 2 Pods, each thinks "I want 1 but see 2" and tries to delete one. Then the other sees 0 and creates one. They'd thrash back and forth.

return "", -1
}
if nodeIndex == 0 {
return RolePrimary, shardIndex
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this function only expected to be used on pending nodes not yet add?
I guess so since this is only correct until there is an failover.
We could also label deployments with its expected role, i.e not using index but some label describing expected cluster role.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could also label deployments with its expected role, i.e not using index but some label describing expected cluster role.

I was thinking along this line in the below comment. We may want to query for pods where a failback is due, and so querying for desired-replication-role vs current-replication-role may be useful.

#81 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only used on pending nodes for now. After a node has been bootstrapped (slots assigned, replicas attached), it moves out of PendingNodes into state.Shards and this function is never called for it again.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a pod-failure with a primary Valkey node, a newly created pod will also be a pending node.
Since a pod with nodeIndex=1 might have take on that role in this shard nodeIndex=0 doesn't mean that is should become a primary, i.e in most failure cases the name will not be deterministic.
I guess we also need to check the current topology before adding this node.

// 2. Ensure the ConfigMap with valkey.conf and health-check scripts exists
// (upsertConfigMap).
// 3. Ensure one Deployment per (shard, role) pair exists, each named
// deterministically (e.g. mycluster-shard0-primary) and labelled with
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mycluster-shard0-primary here is probably an old iteration, but I'm in favor of using a label that could be updated after a failover. These need to be updated in the reconciliation loop.
If we go for using index, are we aiming to force the primary to be placed at index 0, i.e. failback to index 0 as soon as possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this - this comment needs to be updated. We're no longer putting roles (primary vs replica) in deployments' or pods' names or labels, just shard indexes and node indexes. That said, I’m not planning to add controller logic to force the primary to be placed at index 0, as the benefits seem minimal relative to the added complexity. Users are expected to know that node-index = 0 does not always mean "primary".

Comment on lines 106 to 111
return parseRoleAndShard(pods.Items[idx].Name)
}

// parseRoleAndShard extracts the role and shard index from a pod name like
// "mycluster-shard2-0-abc12-xyz". Node index 0 → primary, 1+ → replica.
func parseRoleAndShard(podName string) (string, int) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we already have the pod from the calling parent, couldn't we extract the labels on the pod to get the role and shard-index, instead of having to parse the pod name?

@bjosv
Copy link
Collaborator

bjosv commented Feb 13, 2026

I tested the PR and I think the naming is good. I'm onboard with that.
One problem I see when testing the PR is the expectation of that the primary is always on index 0.
When I kill the pod with index 0 (initially the primary) the replica is taking over as primary. When the operator creates a new deployment the newly a created pod is added as a primary instead of becoming a replica.
Before this change it was added as replica. Is this expected?

NAMESPACE                NAME                                                     READY   STATUS    RESTARTS   AGE
default                  pod/valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb       2/2     Running   0          28s
default                  pod/valkeycluster-sample-shard0-1-7484b74f7d-f2spj       2/2     Running   0          6m9s
default                  pod/valkeycluster-sample-shard1-0-775d7dfbbb-tn5jg       2/2     Running   0          11m
...
---- valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb  10.244.0.14  cluster_state:ok
201ade93737726ee50e97288d88405c9e3a6c0a8 10.244.0.12:6379@16379 master - 0 1770975863088 7 connected 0-5460
c705d86c46105658988c6780564709c49c76d576 10.244.0.9:6379@16379 slave 835d2d404f685b6afa8c3bdfa5698a8528dd385a 0 1770975862281 1 connected
ea9a2e0efe345ec2ba15adcf210e64491f4780ba 10.244.0.11:6379@16379 master - 0 1770975863088 0 connected 10922-16383
835d2d404f685b6afa8c3bdfa5698a8528dd385a 10.244.0.8:6379@16379 master - 0 1770975862686 1 connected 5461-10921
824c398d0eec928d570d824215ebe96e3811d2d8 10.244.0.14:6379@16379 myself,master - 0 0 8 connected
f493e8b32e40848a17fd20702a8811b322ea4e6b 10.244.0.10:6379@16379 slave ea9a2e0efe345ec2ba15adcf210e64491f4780ba 0 1770975862281 0 connected

Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests..

@jdheyburn
Copy link
Collaborator

My preference is that we remove shard from the pod name. We don't have pod_index<N>, that much is learned already. I think simplifying down to valkeycluster-sample-0-0 would be sufficient.

@jdheyburn
Copy link
Collaborator

I tested the PR and I think the naming is good. I'm onboard with that. One problem I see when testing the PR is the expectation of that the primary is always on index 0. When I kill the pod with index 0 (initially the primary) the replica is taking over as primary. When the operator creates a new deployment the newly a created pod is added as a primary instead of becoming a replica. Before this change it was added as replica. Is this expected?

NAMESPACE                NAME                                                     READY   STATUS    RESTARTS   AGE
default                  pod/valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb       2/2     Running   0          28s
default                  pod/valkeycluster-sample-shard0-1-7484b74f7d-f2spj       2/2     Running   0          6m9s
default                  pod/valkeycluster-sample-shard1-0-775d7dfbbb-tn5jg       2/2     Running   0          11m
...
---- valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb  10.244.0.14  cluster_state:ok
201ade93737726ee50e97288d88405c9e3a6c0a8 10.244.0.12:6379@16379 master - 0 1770975863088 7 connected 0-5460
c705d86c46105658988c6780564709c49c76d576 10.244.0.9:6379@16379 slave 835d2d404f685b6afa8c3bdfa5698a8528dd385a 0 1770975862281 1 connected
ea9a2e0efe345ec2ba15adcf210e64491f4780ba 10.244.0.11:6379@16379 master - 0 1770975863088 0 connected 10922-16383
835d2d404f685b6afa8c3bdfa5698a8528dd385a 10.244.0.8:6379@16379 master - 0 1770975862686 1 connected 5461-10921
824c398d0eec928d570d824215ebe96e3811d2d8 10.244.0.14:6379@16379 myself,master - 0 0 8 connected
f493e8b32e40848a17fd20702a8811b322ea4e6b 10.244.0.10:6379@16379 slave ea9a2e0efe345ec2ba15adcf210e64491f4780ba 0 1770975862281 0 connected

Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests..

@bjosv Could we fix this in a follow up PR?

@bjosv
Copy link
Collaborator

bjosv commented Feb 13, 2026

@bjosv Could we fix this in a follow up PR?

Sure.

We should update the top headline for the PR (that becomes the commit message).
I believe the following text was a previous iteration "addValkeyNode reads these labels to determine..".

Signed-off-by: yang.qiu <yang.qiu@reddit.com>
@ysqyang
Copy link
Contributor Author

ysqyang commented Feb 13, 2026

I tested the PR and I think the naming is good. I'm onboard with that. One problem I see when testing the PR is the expectation of that the primary is always on index 0. When I kill the pod with index 0 (initially the primary) the replica is taking over as primary. When the operator creates a new deployment the newly a created pod is added as a primary instead of becoming a replica. Before this change it was added as replica. Is this expected?

NAMESPACE                NAME                                                     READY   STATUS    RESTARTS   AGE
default                  pod/valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb       2/2     Running   0          28s
default                  pod/valkeycluster-sample-shard0-1-7484b74f7d-f2spj       2/2     Running   0          6m9s
default                  pod/valkeycluster-sample-shard1-0-775d7dfbbb-tn5jg       2/2     Running   0          11m
...
---- valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb  10.244.0.14  cluster_state:ok
201ade93737726ee50e97288d88405c9e3a6c0a8 10.244.0.12:6379@16379 master - 0 1770975863088 7 connected 0-5460
c705d86c46105658988c6780564709c49c76d576 10.244.0.9:6379@16379 slave 835d2d404f685b6afa8c3bdfa5698a8528dd385a 0 1770975862281 1 connected
ea9a2e0efe345ec2ba15adcf210e64491f4780ba 10.244.0.11:6379@16379 master - 0 1770975863088 0 connected 10922-16383
835d2d404f685b6afa8c3bdfa5698a8528dd385a 10.244.0.8:6379@16379 master - 0 1770975862686 1 connected 5461-10921
824c398d0eec928d570d824215ebe96e3811d2d8 10.244.0.14:6379@16379 myself,master - 0 0 8 connected
f493e8b32e40848a17fd20702a8811b322ea4e6b 10.244.0.10:6379@16379 slave ea9a2e0efe345ec2ba15adcf210e64491f4780ba 0 1770975862281 0 connected

Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests..

I tested the PR and I think the naming is good. I'm onboard with that. One problem I see when testing the PR is the expectation of that the primary is always on index 0. When I kill the pod with index 0 (initially the primary) the replica is taking over as primary. When the operator creates a new deployment the newly a created pod is added as a primary instead of becoming a replica. Before this change it was added as replica. Is this expected?

NAMESPACE                NAME                                                     READY   STATUS    RESTARTS   AGE
default                  pod/valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb       2/2     Running   0          28s
default                  pod/valkeycluster-sample-shard0-1-7484b74f7d-f2spj       2/2     Running   0          6m9s
default                  pod/valkeycluster-sample-shard1-0-775d7dfbbb-tn5jg       2/2     Running   0          11m
...
---- valkeycluster-sample-shard0-0-57cb5dc6b5-7qmlb  10.244.0.14  cluster_state:ok
201ade93737726ee50e97288d88405c9e3a6c0a8 10.244.0.12:6379@16379 master - 0 1770975863088 7 connected 0-5460
c705d86c46105658988c6780564709c49c76d576 10.244.0.9:6379@16379 slave 835d2d404f685b6afa8c3bdfa5698a8528dd385a 0 1770975862281 1 connected
ea9a2e0efe345ec2ba15adcf210e64491f4780ba 10.244.0.11:6379@16379 master - 0 1770975863088 0 connected 10922-16383
835d2d404f685b6afa8c3bdfa5698a8528dd385a 10.244.0.8:6379@16379 master - 0 1770975862686 1 connected 5461-10921
824c398d0eec928d570d824215ebe96e3811d2d8 10.244.0.14:6379@16379 myself,master - 0 0 8 connected
f493e8b32e40848a17fd20702a8811b322ea4e6b 10.244.0.10:6379@16379 slave ea9a2e0efe345ec2ba15adcf210e64491f4780ba 0 1770975862281 0 connected

Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests..

  1. We've establised that node index = 0 does not necessarily mean primary. Users are expected to be aware of the consequences of failover; 2. This implementation does not fully support failover which I believe we need a dedicated PR (or PR stack) for. The current PR is already big as is.

Copy link
Collaborator

@bjosv bjosv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Support of primary failure handling is to be re-added in separate PR.

@bjosv bjosv merged commit 02eedaf into valkey-io:main Feb 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants