Use deterministic deployment names to encode shard topology#81
Use deterministic deployment names to encode shard topology#81bjosv merged 5 commits intovalkey-io:mainfrom
Conversation
22275a6 to
b16818e
Compare
Deployments and their pod templates are now stamped with valkey.io/shard-index and valkey.io/role labels when created by upsertDeployments. addValkeyNode reads these labels to determine whether a pending node should become a primary (assign slots) or a replica (CLUSTER REPLICATE to the matching shard's primary), removing the heuristic that previously guessed roles by counting existing shards. Signed-off-by: yang.qiu <yang.qiu@reddit.com>
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
51c97fd to
1e7a509
Compare
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
7dc6b59 to
a52acd6
Compare
| // <cluster>-shard<N>-<M> e.g. "mycluster-shard0-0", "mycluster-shard1-2" | ||
| // | ||
| // By convention, node 0 is the initial primary and nodes 1, 2, … are replicas. | ||
| // The name deliberately avoids "primary"/"replica" because failover can swap |
There was a problem hiding this comment.
I also think its good that we avoid primary/replica in the name.
We only have to keep in mind, as stated here, that after a failure the primary can be at index 2 (of 0-2).
If the user want to scale down replicas either the replica at index 0 or index 1 needs to go, i.e. there can be indexes missing and the index doesn't say that much anymore.
In StatefulSets indexes are used since it's always the last index/replica that is removed, but in Deployments, where any replica can be removed (depending of other factors), they have chosen to use a unique name/hash instead.
I hesitate a bit to use "index" instead of a random suffix, but I understand the reason for using it..
There was a problem hiding this comment.
We only have to keep in mind, as stated here, that after a failure the primary can be at index 2 (of 0-2).
If the user want to scale down replicas either the replica at index 0 or index 1 needs to go, i.e. there can be indexes missing and the index doesn't say that much anymore.
In this scenario, I would expect a failover to either index 0 or 1, and index 2 is removed.
There was a problem hiding this comment.
Failing logic is going to be addressed in a future PR
| // logical node in the Valkey cluster. The Deployment name encodes the shard | ||
| // and node index (e.g. "mycluster-shard0-0", "mycluster-shard1-2"). Two | ||
| // labels (valkey.io/shard-index and valkey.io/node-index) provide selector | ||
| // uniqueness so no two Deployments fight over the same Pod. |
There was a problem hiding this comment.
Not sure about the "fight" comment here, how does two Deployments fight over a pod?
There was a problem hiding this comment.
I believe it means that a Deployment selector will always have a 1:1 mapping to a pod (singleton Deployments).
There was a problem hiding this comment.
Without valkey.io/shard-index and valkey.io/node-index, all deployments have overlapping selectors, which means each Deployment's controller will independently count the matching Pods and try to reconcile to its desired replica count. So if Deployment A (replicas=1) and Deployment B (replicas=1) both select the same 2 Pods, each thinks "I want 1 but see 2" and tries to delete one. Then the other sees 0 and creates one. They'd thrash back and forth.
| return "", -1 | ||
| } | ||
| if nodeIndex == 0 { | ||
| return RolePrimary, shardIndex |
There was a problem hiding this comment.
Is this function only expected to be used on pending nodes not yet add?
I guess so since this is only correct until there is an failover.
We could also label deployments with its expected role, i.e not using index but some label describing expected cluster role.
There was a problem hiding this comment.
We could also label deployments with its expected role, i.e not using index but some label describing expected cluster role.
I was thinking along this line in the below comment. We may want to query for pods where a failback is due, and so querying for desired-replication-role vs current-replication-role may be useful.
There was a problem hiding this comment.
This is only used on pending nodes for now. After a node has been bootstrapped (slots assigned, replicas attached), it moves out of PendingNodes into state.Shards and this function is never called for it again.
There was a problem hiding this comment.
After a pod-failure with a primary Valkey node, a newly created pod will also be a pending node.
Since a pod with nodeIndex=1 might have take on that role in this shard nodeIndex=0 doesn't mean that is should become a primary, i.e in most failure cases the name will not be deterministic.
I guess we also need to check the current topology before adding this node.
| // 2. Ensure the ConfigMap with valkey.conf and health-check scripts exists | ||
| // (upsertConfigMap). | ||
| // 3. Ensure one Deployment per (shard, role) pair exists, each named | ||
| // deterministically (e.g. mycluster-shard0-primary) and labelled with |
There was a problem hiding this comment.
mycluster-shard0-primary here is probably an old iteration, but I'm in favor of using a label that could be updated after a failover. These need to be updated in the reconciliation loop.
If we go for using index, are we aiming to force the primary to be placed at index 0, i.e. failback to index 0 as soon as possible?
There was a problem hiding this comment.
Thanks for catching this - this comment needs to be updated. We're no longer putting roles (primary vs replica) in deployments' or pods' names or labels, just shard indexes and node indexes. That said, I’m not planning to add controller logic to force the primary to be placed at index 0, as the benefits seem minimal relative to the added complexity. Users are expected to know that node-index = 0 does not always mean "primary".
internal/controller/utils.go
Outdated
| return parseRoleAndShard(pods.Items[idx].Name) | ||
| } | ||
|
|
||
| // parseRoleAndShard extracts the role and shard index from a pod name like | ||
| // "mycluster-shard2-0-abc12-xyz". Node index 0 → primary, 1+ → replica. | ||
| func parseRoleAndShard(podName string) (string, int) { |
There was a problem hiding this comment.
Since we already have the pod from the calling parent, couldn't we extract the labels on the pod to get the role and shard-index, instead of having to parse the pod name?
|
I tested the PR and I think the naming is good. I'm onboard with that. Unfortunately, the part of the testcase that tests deleting a primary was temporary disabled in #54 , so that's why we don't see it in our e2e tests.. |
|
My preference is that we remove |
@bjosv Could we fix this in a follow up PR? |
Sure. We should update the top headline for the PR (that becomes the commit message). |
Signed-off-by: yang.qiu <yang.qiu@reddit.com>
|
bjosv
left a comment
There was a problem hiding this comment.
LGTM!
Support of primary failure handling is to be re-added in separate PR.
Context
#55. This is a precursor to the resharding PR: #60
What changed
<cluster>-<N>-<M>, e.g.mycluster-0-0) instead of usingGenerateNamewith random suffixes.valkey.io/shard-indexandvalkey.io/node-indexlabels at creation time. By convention, node 0 is the initial primary and nodes 1+ are replicas.addValkeyNodereads the pod'snode-indexlabel (viapodRoleAndShard) to determine whether a pending node should become a primary (assign slots) or a replica (CLUSTER REPLICATEto the matching shard's primary).pickPendingNodehelper ensures primaries are processed before replicas, sinceCLUSTER REPLICATErequires the primary to already have slots.assignSlotsToNewPrimaryandreplicateToShardPrimaryfunctions.Why
The old code guessed node roles by counting existing shards — if fewer shards existed than desired, the next pending node became a primary; otherwise it became a replica. This heuristic was fragile: it could misassign roles when pods started out of order or when gossip hadn't yet propagated topology changes. Labels make the intended role explicit at Deployment creation time, removing all guesswork from the reconciler.
Deterministic names make it easy to correlate pods, deployments, and Valkey nodes during debugging (
kubectl get podsimmediately shows which shard and position each pod belongs to), and makeupsertDeploymentsidempotent — it simply tries to create each name and ignoresAlreadyExists.Testing