Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
127 changes: 127 additions & 0 deletions 1xx-pool-rack-awareness.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
# Pool-based Rack Awareness

Rack-awareness in Strimzi Kafka can be achieved without broker access to cluster-level Kubernetes APIs
by assigning one node pool per rack/availability zone.

## Current situation

In Kubernetes clusters spanning multiple availability zones, Kafka can tolerate the loss of an
entire zone by ensuring partition replicas are spread across brokers in multiple zones.

This is achieved by configuring the [`broker.rack`](https://kafka.apache.org/documentation/#brokerconfigs_broker.rack)
property in Kafka to enable rack-aware replication assignment, where a rack is analogous to an availability zone.

Strimzi currently provides rack-awareness through the usage of an init container which queries the
Kubernetes API for the value of a specified topology label on the Kubernetes node where that
broker is running.
This requires access to the Kubernetes API by the broker pods and requires cluster-scoped RBAC.

## Motivation

For users interested in a heightened security posture, the requirements of the current rack-awareness
implementation are prohibitive.
Many users may require adherence to the [separation of duty security principle](https://csrc.nist.gov/glossary/term/separation_of_duty)
under which application pods processing user data should not have access to the Kubernetes API.
All usage of the Kubernetes API must then be delegated to the operator.
Comment on lines +23 to +25
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but there is nothing like that being said in the link.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked definition from NIST discussed separation of duty at a higher level, not specific to Kubernetes.

In our specific Kubernetes case, we are separating two different roles for two different entities:

  • Operators which act on the Kubernetes API and therefore have associated RBAC
  • Operands which process/manage data and do not have access to the Kubernetes API

If you feel the link is misleading, I can remove it.

The underlying principle is one IBM has discussed with many enterprise clients in highly regulated industries (eg. financial services, telecommunications, etc). I'm not sure if the specific requirements are publicly documented by those companies.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think it is absolutely misleading. And I do not think it is any principle Strimzi aims to follow.

Also, keep in mind that the broker is also reading Kubernetes Secrets for example. So even if you want to follow your interpretation of this rule, this proposal won't help you much.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given there are broader concerns as you note with limiting broker access to the Kubernetes API, I've opened https://github.com/orgs/strimzi/discussions/12086 to focus on that. I can remove this as a motivation from the proposal and focus on the motivation of removing cluster-scoped access.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, in general it is always better to start with a discussion to see if there is an actual interest rather than open a proposal out of the blue. Keep in mind that the motivation might make sense to you. But it also needs to make sense to the community and the people working on the projects day to day in order to justify the effort it brings.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I apologize for misunderstanding the process here -- I thought the proposal would be the avenue to start the discussion.

I'll try to improve the proposal to make the motivation easier to understand for others. I also want to note that I am happy to help implement the necessary changes. I do recognize though there is still a cost on the maintainers to review PRs and ensure continued functionality going forward.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is one of the ways how to start the discussion. But pretty lengthy way. Starting a GitHub discussion, opening an issue where you describe the feature you are interested in, asking on Slack, or on the community call often makes it much faster and easier to get some quick feedback. (But don't get me wrong - it is not like you broke any special rules or did something wrong. It just might not be the most efficient way.)


Furthermore, many Kubernetes cluster administrators may restrict access to cluster-scoped Kubernetes
resources to ensure an application and the user managing it are contained within a limited set of namespaces.
Today, Strimzi only requires access to cluster-scoped Kubernetes resources for rack-awareness and NodePort
listener configuration.

Implementing the proposed method for pool-based rack awareness removes these potentially prohibitive
requirements while maintaining a simple deployment model.

## Proposal

The `Kafka` CR will be updated to include an `idType` field which signifies the type of rack ID
which will be configured.

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
annotations:
strimzi.io/node-pools: enabled
name: my-cluster
spec:
kafka:
rack:
idType: pool-name
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there would be some reasonable justification for changing this, then it would make much more sense to just configure it in a separate field then hardcode it to node pool name which is very limited.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this involve moving the API change to the KafkaNodePool CR and allow specifying an arbitrary rack ID for each pool?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure what you mean by that, sorry. But the node pool name is clearly not the right determinant as there would be many reasons for multiple pools in a single zone.

```

The field will be an optional enumeration type with the following values:

* `node-label` (default)
* This rack ID type maintains the existing behavior where rack IDs are configured using a node label
* The node label is determined using the existing `topologyKey` field
* `pool-name`
* This rack ID type will use the node pool name as the rack ID

When using the `pool-name` rack ID type, users must configure pod affinity and anti-affinity in the Kafka
pod template to ensure:

* Brokers within a pool are scheduled in the same availability zone
* Brokers in different pools are scheduled in different availability zones

This affinity configuration would be configured in the KafkaNodePool:

```yaml
apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaNodePool
metadata:
name: my-zone0-pool
labels:
strimzi.io/cluster: my-cluster
spec:
template:
pod:
affinity:
podAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector:
matchLabels:
strimzi.io/cluster: my-cluster
strimzi.io/pool-name: my-zone0-pool
topologyKey: topology.kubernetes.io/zone
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 90
podAffinityTerm:
labelSelector:
matchExpressions:
- key: strimzi.io/cluster
operator: In
values:
- my-cluster
- key: strimzi.io/pool-name
operator: NotIn
values:
- my-zone0-pool
topologyKey: topology.kubernetes.io/zone
```

## Affected/not affected projects

Only the [strimzi-kafka-operator](https://github.com/strimzi/strimzi-kafka-operator/) would be affected:

* Changes to the `Rack` API type
* New optional `idType` field
* Change `topologyKey` field from required to optional
* Changes to the cluster operator to:
* Only create the ClusterRoleBinding when using the `node-label` rack ID type or when using a NodePort listener
* Modify the `KafkaBrokerConfigurationBuilder` to use the pool name as the rack ID when the rack ID type is `pool-name`

## Compatibility

This proposal maintains CRD compatibility by introducing a new, optional field.
All existing configurations would continue to be valid and maintain their existing behavior.

## Rejected alternatives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please keep in mind that there are also some new Kubernetes features coming to the downward API as discussed in strimzi/strimzi-kafka-operator#11504. If nothing else, that should be mentioned here. But likely we might want to wait for how it turns out.


* Continue with existing rack awareness support using Kubernetes API access in a broker init container
* This alternative was rejected due to the heightened security requirements of some users, as outlined in the [Motivation](#motivation)
* Use node topology labels via the downward API as suggested in [strimzi-kafka-operator#11504](https://github.com/strimzi/strimzi-kafka-operator/issues/11504)
* This alternative was rejected due to a reliance on a Kubernetes feature that is only in alpha as of 1.33