Skip to content

Comments

(rfc): Design Proposal for additional CRDs#63

Open
jdheyburn wants to merge 14 commits intovalkey-io:mainfrom
jdheyburn:jdheyburn/rfc/crd-design
Open

(rfc): Design Proposal for additional CRDs#63
jdheyburn wants to merge 14 commits intovalkey-io:mainfrom
jdheyburn:jdheyburn/rfc/crd-design

Conversation

@jdheyburn
Copy link
Collaborator

@jdheyburn jdheyburn commented Jan 23, 2026

Would love to get as many opinions on this.

Please leave comments on the PR where appropriate.

I don't anticipate merging this in its current form; rather when we have agreed on a design for each CRD they can be merged in individually (or discussed on Discussions).

There are still some CRD design guidelines that I need to enforce:

  • Nested sub-problems
  • Avoid bool fields
  • Use +listType=map
  • Conditions use metav1.Condition
  • observedGeneration
  • Enum extensibility documented
  • No cross-namespace refs by default
  • Sentence test for naming
  • Status uses adjectives/past-tense
  • Glossary defined

The guidelines:

Here's a mock diagram of the CRD relationships. ValkeyCluster is similar, but it only provisions ValkeyNode.


⏺ ┌─────────────────────────────────────────────────────────────────────────────┐
  │                              ValkeyPool                                     │
  │                              (cache)                                        │
  │  ┌─────────────────────────────────────────────────────────────────────┐   │
  │  │ spec:                                                                │   │
  │  │   shards: 4                                                          │   │
  │  │   replicasPerShard: 2                                                │   │
  │  │   sentinel:                                                          │   │
  │  │     enabled: true ─────────────────────────────────┐                 │   │
  │  │   template: ...                                    │                 │   │
  │  └────────────────────────────────────────────────────│─────────────────┘   │
  └───────────────────────────────────────────────────────│─────────────────────┘
                            │                             │
                            │ creates                     │ creates
                            ▼                             ▼
  ┌─────────────────────────────────────────┐   ┌─────────────────────────┐
  │              Valkey (cache-0)           │   │    ValkeySentinel       │
  │  ┌───────────────────────────────────┐  │   │    (cache-sentinel)     │
  │  │ spec:                             │  │   │  ┌───────────────────┐  │
  │  │   replicas: 2                     │  │   │  │ spec:             │  │
  │  │   failover:                       │  │   │  │   replicas: 3     │  │
  │  │     mode: sentinel                │  │   │  └───────────────────┘  │
  │  │     sentinel:                     │  │   │                         │
  │  │       ref: ───────────────────────│──│───│───▶ (referenced)        │
  │  │         name: cache-sentinel      │  │   │                         │
  │  │       config:                     │  │   │  Creates 3 Sentinel     │
  │  │         quorum: 2                 │  │   │  pods that monitor      │
  │  │         downAfterMillis: 30000    │  │   │  all Valkey instances   │
  │  └───────────────────────────────────┘  │   │  referencing it         │
  │                   │                      │   └─────────────────────────┘
  │                   │ creates              │               ▲
  │                   ▼                      │               │
  │  ┌───────────────────────────────────┐  │               │
  │  │ ValkeyNode (cache-0-primary)      │  │               │
  │  │ ValkeyNode (cache-0-replica-0)    │  │               │
  │  │ ValkeyNode (cache-0-replica-1)    │  │               │
  │  └───────────────────────────────────┘  │               │
  └─────────────────────────────────────────┘               │
                                                            │
  ┌─────────────────────────────────────────┐               │
  │              Valkey (cache-1)           │               │
  │  ┌───────────────────────────────────┐  │               │
  │  │ spec.failover.sentinel.ref: ──────│──│───────────────┘
  │  │   name: cache-sentinel            │  │
  │  └───────────────────────────────────┘  │
  │                   │                      │
  │                   ▼                      │
  │  ┌───────────────────────────────────┐  │
  │  │ ValkeyNode (cache-1-primary)      │  │
  │  │ ValkeyNode (cache-1-replica-0)    │  │
  │  │ ValkeyNode (cache-1-replica-1)    │  │
  │  └───────────────────────────────────┘  │
  └─────────────────────────────────────────┘

    ... (cache-2, cache-3 similar)

@jdheyburn jdheyburn added documentation Improvements or additions to documentation question Further information is requested labels Jan 23, 2026

### Design principles

TODO comment to ask if there are any others that should be included
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything else I might've missed?


### Replication and Failover Semantics

TODO these semantics needs to be finalised
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'll need another RFC to discuss how we want replication/failover/etc, to be managed. I have a great understanding of how we would do this for Sentinel, but I might need some help when it is operator managed.

Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn jdheyburn force-pushed the jdheyburn/rfc/crd-design branch from 9fae52d to 1406af5 Compare January 23, 2026 14:17
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn jdheyburn force-pushed the jdheyburn/rfc/crd-design branch from d43e773 to d641add Compare January 24, 2026 18:27
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
Signed-off-by: Joseph Heyburn <jdheyburn@gmail.com>
@jdheyburn jdheyburn force-pushed the jdheyburn/rfc/crd-design branch from 7cbb7e6 to 99f5047 Compare January 26, 2026 21:26
@ysqyang
Copy link
Contributor

ysqyang commented Jan 29, 2026

Are we allowing users to create ValkeyNode directly? I see that it's listed as an "internal" CRD in the doc but just wanted to clarify.

@jdheyburn
Copy link
Collaborator Author

@ysqyang I would not expect users to create them, where they are managed by the Operator. The user can query for ValkeyNode on the CLI so that it can get the status. For example:

kubectl get valkeynode --selector valkey.io/valkeycluster=mycluster
NAME               ROLE
valkey-master-0    master
valkey-replica-0   replica

jdheyburn and others added 2 commits February 4, 2026 17:35
Documents the approved design for implementing ValkeyNode CRD with:
- StatefulSet-based singleton workloads
- Headless Service for stable DNS identity
- Minimal operational status fields
- Unit + integration testing strategy

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Prevents worktree contents from being tracked.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
type: ClusterIP
annotations: {}

# Valkey-specific configuration (only when type=valkey)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what type=valkey means here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am removing it. In a previous iteration I had Sentinel config here too, but I then decided to keep ValkeyNode exclusive for a valkey server.


# Valkey-specific configuration (only when type=valkey)
valkeyConfig:
# Cluster configuration (presence-based)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be used for standalone mode as well as in cluster mode, i.e. to set a config with some options that a user wants?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

valkeyConfig or config is defined in standalone mode too. This comment refers to the cluster section below it

# Cluster configuration (presence-based)
# If set, node runs in cluster mode with assigned slots
# Omit entirely for standalone/replicated mode
# cluster:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slots are provided to the Valkey Cluster instance when the controller has decided that this specific pod should be a primary, then connects to the valkey process via a client and then sends commands to add slots.
If we want to use configs as an extra step it makes it abit more complex I think. Something needs to give commands to the instance.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In either case, the slots allocated to a ValkeyNode would need to be defined somewhere in the spec - then the ValkeyNode controller can issue the commands to the node. Does that sound about right?

Perhaps this section can be reworked to:

# saved to ConfigMap, mounted to pod at valkey.conf
# also 
config: 
  key: value

# if defined, this is in cluster mode, and issues the commands to the cluster server
cluster:
  slots: 0-10000

message: "All shards healthy, all slots assigned"

totalShards: 3
readyShards: 3
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its tough for a cluster-user to get a grip on that we are using shard from the Valkey domain, but the replicas is from the K8s domain and not Valkey.. the primary-replica concept is so deeply rooted..

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is a grey-area, one that I am happy to go with the consensus on. We could have replicasPerShard or podsPerShard. For now, I wanted to reuse the terminology from K8s domain.

**ValkeyNode naming:** `<cluster-name>-<shard-index>-<replica-index>`

Examples:
- `mycluster-0-0`: Shard 0, replica index 0 (primary)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should use a short hash here instead of indexes, at least we shouldn't give an example where replica index 0 always is a primary. The primary will be moved when there is a failure.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need a discussion about how the pods are named, everyone has their own opinion of it.


**Formula:**
```go
slotsPerShard := 16384 / shards
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a feeling, but maybe could skip code in this doc to tighten it, it's a big document

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very fair. I'll work on condensing it down so that it focuses on CRD spec.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation question Further information is requested

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants