Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

ShubhamRwt · 2025-07-14T11:45:21Z

This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the auto-rebalancing feature of Strimzi to introduce the self healing.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

106-self-healing-feature-in-operator.md

tomncooper

I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.

I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.

For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.

106-self-healing-feature-in-operator.md

scholzj

I think this is going in the right direction. But I think it needs to go a bit deeper:

We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called mode: full? Calling it full seems confusing - does it mean that full includes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybe mode: rebalance or mode: skew or something like that would make more sense?

106-self-healing-feature-in-operator.md

ppatierno · 2025-07-17T08:37:54Z

@scholzj good to know that you like the track we are now :-)

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

scholzj · 2025-07-17T16:22:17Z

Regarding the "full" related naming, we were just reusing the underneath mode naming for the KafkaRebalance custom resource that will be used for fixing the anomaly (a rebalance which includes the entire cluster).
This is kind of similar with the usage off add-brokers and remove-brokers we are using when auto-rebalancing on scaling.
Said that, we can fine a better mode name at higher level but still using the "full" mode at KafkaRebalance level.
Not sure about mode "rebalance" as suggested because it would be weird within a "autoRebalance" field. The "skew" suggestion could sound better. But also what about something around "goal-violation" or "fix-goal-violation" if we are focusing on such anomaly right now. Anyway, naming is difficult so let's see what others think as well.

I do not think this works here. KafkaRebalance is essentially an imperative API (although implemented through a declarative resource). You are sending a command to the CO to do a full rebalance.

The autoRebalance section in the Kafka CR is a declarative API. You are declaring how CO should automatically react to some situations. add-brokers and remove-brokers works well in both as it is a command as well as event description. full IMHO does not work that well in the declarative mode because as I said, it can be easily interpreted as full == all available options (i.e. including scale-up or scale-down). That is where the idea of skew comes from as from my understanding in this proposal we are reacting to skew -> the skew can be a CPU inbalance, Disk inbalance etc.

goal-violation sounds reasonable ... but I wonder if it is too generic. I assume that the future modes ... e.g. CCs suggestion to scale-up, scale-down, bad distribution across racks, broken disks or brokers ... those are also goal violations, or? But you cannot solve these by creating a KafkaRebalance. So they will need their own modes as well. That is kind of the context in whcih I'm trying to see the mode names.

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

106-auto-rebalance-on-imbalanced-clusters.md

ppatierno · 2025-08-01T09:17:08Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c.
+When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource.
+Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description).


It's a repetition of the above sentence. Maybe you can delete it but adding the link to the anomaly detector manager to the previous sentence.

ppatierno · 2025-08-01T09:18:00Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+## Motivation
+
+Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.


How a user is notified by anomalies currently? What are you referring to?

Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.

ppatierno · 2025-08-01T09:18:43Z

106-auto-rebalance-on-imbalanced-clusters.md

+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
+It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
+
+### Introduction to Self Healing


Suggested change

### Introduction to Self Healing

### Introduction to Self Healing in Cruise Control

You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."

ppatierno · 2025-08-01T09:21:01Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The above flow diagram depicts the self-healing process in Cruise Control.
+The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
+The configured notifiers provides alerts to the users about the detected anomaly and also returns the action that needs to be taken on the anomaly i.e. whether to fix it, ignore it or delay it.


The alert mechanism isn't out of the box. A notifier can have its own logic without generating any alerts. Even just triggering the fix without notifying anyone what's happening. So there is no assumption that a "configured" notifier provides alerts. I think this sentence should say that the notifier makes the decision about the action to take. Then CC provides some notifiers which are able to alert the user in several ways (MS Teams, Slack, etc etc).

ppatierno · 2025-08-01T09:45:48Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.
+
+#### What happens if some unfixable goal violation happens


Suggested change

#### What happens if some unfixable goal violation happens

#### What happens if some unfixable goal violation happens

ppatierno · 2025-08-01T09:46:20Z

106-auto-rebalance-on-imbalanced-clusters.md

+If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.
+
+#### What happens if some unfixable goal violation happens
+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.


Still need an example here for better understanding how this is prompted to the user.

Having prometheus metrics for such cases might be reasonable default way?

ppatierno · 2025-08-01T09:46:25Z

106-auto-rebalance-on-imbalanced-clusters.md

+#### What happens if some unfixable goal violation happens
+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.
+
+#### What happens if same anomaly is detected again while the auto-rebalance is happening


Suggested change

#### What happens if same anomaly is detected again while the auto-rebalance is happening

#### What happens if same anomaly is detected again while the auto-rebalance is happening

ppatierno · 2025-08-01T09:48:41Z

106-auto-rebalance-on-imbalanced-clusters.md

+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.
+
+#### What happens if same anomaly is detected again while the auto-rebalance is happening
+Since the cluster operator has the knowledge regarding the detected violation, we will ignore the anomalies while the rebalancing is happening. In case the anomaly still exists after the rebalance, Cruise Control will detect it again and a new rebalance would be triggered


so it seems to assume that if a first anomaly is created, the notifier creates the corresponding ConfigMap and the CO takes care of running a rebalancing. While the rebalancing is running, CC detects other anomalies, so the notifier is creating a bunch of other ConfigMaps that the CO is ignoring. Finally, the rebalancing ends ... the CO will find all these ConfigMaps ... what's going to do? This is where if it takes care of them we could:

lose the priority of them (ConfigMaps don't have priority)

the old anomalies could have been fixed by the previous rebalancing so it's useless handling them

I think the best option in this case would be to ignore the configmap and also delete it at the same time. I think I didn't mentioned it here which my mistake but later in the flowchart, I say that if anomalies are detected during a rebalance is happening, we will just gnore that configmap and delete it.

ppatierno · 2025-08-01T09:51:26Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **RebalanceOnScaleDown** to:
+  * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
+  * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
+  * **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed


are we really sure that if a rebalance is running for scale up or scale down, after that we should take care of the anomaly? Is it possible that the anomaly was somehow fixed because of the auto-rebalancing due to scale up or down? My gut feeling is that we could avoid to take care of an anomaly even because if the problem is still in place, it will be raised again by CC and then we'll deal with it. @tomncooper @scholzj wdyt?

I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?

Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.

I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.

I do not think I can really comment as I do not know how this really works in CC. I raised similar point before with regards to imbalance that cannot be fixed (e.g. because one partition causing the imbalance is too big etc.). Will it be raised again and again? Do we need to somehow detedt those and ignore them? Etc. So this is a bit similar. How do you know it was already resolved or not and will it be repeated or not. 🤷

@tomncooper @ppatierno you are correct, we should ignore and delete the configmap at the same time if rebalance is happening. I think I didn't mentioned it here which is my mistake but later in the flowchart, I show that if anomalies are detected during a rebalance is happening, we will just ignore that configmap and delete it. As for unfixable anomalies which can keep appearing, there is code present in the Cruise Control SelfHealingNotifier which I am going to utilize. That method checks if the rebalance can be performed on the Goal violation or not. If the goal vioaltion cannot be fixed, then we just ognore the anomaly and no configmap would be created in that case

tomncooper

Ok I did another pass. I have a few questions:

How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated KafkaRebalance CRs? But it is not clear.

tomncooper · 2025-08-01T10:14:04Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+## Motivation
+
+Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.


Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.

tomncooper · 2025-08-01T10:25:14Z

106-auto-rebalance-on-imbalanced-clusters.md

+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
+It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
+
+### Introduction to Self Healing


You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."

106-auto-rebalance-on-imbalanced-clusters.md

tomncooper · 2025-08-01T13:36:28Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **RebalanceOnScaleDown** to:
+  * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
+  * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
+  * **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed


I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?

Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.

I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.

106-auto-rebalance-on-imbalanced-clusters.md

im-konge · 2025-08-08T15:49:59Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c.
+When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource.
+Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description).


106-auto-rebalance-on-imbalanced-clusters.md

im-konge · 2025-08-08T16:16:09Z

106-auto-rebalance-on-imbalanced-clusters.md

+  finalizers:
+    - strimzi.io/auto-rebalancing
+spec:
+  mode: skew


So it will be always full and there will be no new mode like imbalance or skew, right?

im-konge · 2025-08-08T16:17:24Z

106-auto-rebalance-on-imbalanced-clusters.md

+#### AnomalyDetectorNotifier
+
+Cruise Control provides the `AnomalyNotifier` interface, which has multiple abstract methods on what to do if certain anomalies are detected.
+Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.


Suggested change

Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.

Some of those methods are: `onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()`.

I guess you don't have to use etc. here if you are just naming some of them, but I'm not a native speaker :)

im-konge · 2025-08-08T16:20:45Z

106-auto-rebalance-on-imbalanced-clusters.md

+# ...
+```
+
+The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance.


Yeah should it follow the names of other things like <cluster-name>-goal-violation-<anomalyID>? I guess that will be also easier to find in case that you would like to search all Namespaces for these kind of ConfigMap.

im-konge · 2025-08-08T16:23:08Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Users cannot configure the notifier if they are utilising the auto-rebalance on imbalanced cluster.
+This is because the operator is using our custom notifier for getting alerts about goal violations. 
+If the users try to override the notifier while the `skew` mode is enabled, the auto-rebalance `skew` configuration then the operator would throw errors in the auto-rebalance status field


Maybe could you re-phrase it a bit - I'm confused a bit by the the auto-rebalance skew configuration then the operator would throw errors. What do you mean by that?

106-auto-rebalance-on-imbalanced-clusters.md

tinaselenge · 2025-08-11T09:43:25Z

106-auto-rebalance-on-imbalanced-clusters.md

+  finalizers:
+    - strimzi.io/auto-rebalancing
+spec:
+  mode: skew


I think I would get confused to see different mode name, full here. We would have to explain how that map to skew or imbalance mode we introduced.

106-auto-rebalance-on-imbalanced-clusters.md

nickgarvey · 2025-08-18T18:42:54Z

Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows.

I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else.

An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete mode: skew on the CruiseControl CR to stop automatic rebalances, but it could be more clear.

Thanks for putting this together, excited to see this.

ppatierno · 2025-08-19T11:32:42Z

@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the stop annotation on the KafkaRebalance (of course the current batch has to finish first). With auto-rebalancing, the KafkaRebalance is owned by the operator and not by the user. That's anyway a good feedback because there is no clear way for the user to stop an auto-rebalancing in progress. I think you could apply the stop annotation on the KafkaRebalance resource but you can't delete it due to a finalizer. Then you should delete the corresponding mode within the spec.cruiseControl.autoRebalance.mode field to avoid the re-triggering. It's something to think about.

see-quick · 2025-08-19T13:26:43Z

106-auto-rebalance-on-imbalanced-clusters.md

+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
+It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.


Suggested change

With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.

It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.

In smaller clusters, anomalies can still be fixed manually. But as clusters grow, doing this becomes time-consuming or even impractical. For Strimzi users, it would be highly valuable if such anomalies could be detected and fixed automatically.

see-quick · 2025-08-19T13:58:24Z

106-auto-rebalance-on-imbalanced-clusters.md

+* Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics).
+
+The detected anomalies are inserted into a priority queue where comparator is based upon the priority value and the detection time.
+The smaller the priority value and detected time is, the higher priority the anomaly type has.


Suggested change

The smaller the priority value and detected time is, the higher priority the anomaly type has.

An anomaly is considered more important if it has a lower priority value and shorter detection time.

106-auto-rebalance-on-imbalanced-clusters.md

see-quick · 2025-08-20T04:37:32Z

106-auto-rebalance-on-imbalanced-clusters.md

+They can configure auto-rebalance to enable only for their specific case i.e. setting only `skew` mode or other scaling related modes.
+Once the auto-rebalance with `skew` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced.
+To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly. 
+We will create our own custom notifier named `AnomalyDetectorNotifier` to do the same.


Yeah, that would make it more flexible for future changes, so +1 for naming it more generic way...

see-quick · 2025-08-20T04:40:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+The auto-rebalance configuration for the `spec.cruiseControl.autoRebalance.template` property in the `Kafka` custom resource is provided through a `KafkaRebalance` custom resource defined as a "template".
+That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set.
+When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing.
+This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.


Suggested change

This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.

This is not an actual rebalance request to get an optimization proposal; it is simply where the configuration for auto-rebalancing is defined.

see-quick · 2025-08-20T04:41:52Z

106-auto-rebalance-on-imbalanced-clusters.md

+That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set.
+When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing.
+This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.
+The user can specify rebalancing goals and other configuration for rebalancing, within the resource.


Suggested change

The user can specify rebalancing goals and other configuration for rebalancing, within the resource.

The user can specify rebalancing goals and configuration in the resource.

106-auto-rebalance-on-imbalanced-clusters.md

Frawless · 2025-08-20T09:56:12Z

106-auto-rebalance-on-imbalanced-clusters.md

+```
+
+The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance.
+Separate configmaps would be created for every goal violation such that on completion of the rebalance we can remove the particular configmap.


Guess the operator will remove to CM instead of users, right?

Frawless · 2025-08-20T10:02:18Z

106-auto-rebalance-on-imbalanced-clusters.md

+If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.
+
+#### What happens if some unfixable goal violation happens
+In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section.


Having prometheus metrics for such cases might be reasonable default way?

Frawless · 2025-08-20T10:24:11Z

106-auto-rebalance-on-imbalanced-clusters.md

+  A[KafkaClusterCreator] --creates--> B[KafkaCluster]
+  B -- calls --> D[KafkaAutoRebalancingReconciler.reconcile]
+  D -- check for configmap with goal-violation prefix --> E{if config map present?}
+  D -- if rebalance in progress --> F[ignore new configmaps and delete them]


which CMs will be deleted in that case? My understanding is that KafkaAutoRebalancingReconciler will not create new ones. What about these CMs that are used for current rebalancing? These should be deleted by rebalancing itself or?

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

106-auto-rebalance-on-imbalanced-clusters.md

kyguy · 2025-09-19T18:14:21Z

106-auto-rebalance-on-imbalanced-clusters.md

+This field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations).
+To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise.
+They don't require to set up all the modes and can enable the modes they require.
+They can configure auto-rebalance to enable only for their specific case i.e. setting only `imbalance` mode or other scaling related modes.


It seems like the first sentence of the above three summarizes this well enough, we could probably remove the bottom two sentences.

106-auto-rebalance-on-imbalanced-clusters.md

kyguy · 2025-09-19T20:10:23Z

106-auto-rebalance-on-imbalanced-clusters.md

+To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly. 
+We will create our own custom notifier named `StrimziCruiseControlNotifier` to do the same.
+This notifier's job will be to update the operator regarding the goal violations so that the operator can trigger a rebalance (see section [AnomalyDetectorNotifier](./106-auto-rebalance-on-imbalanced-clusters.md#anomalydetectornotifier)).
+With this proposal, we are only going to support auto-rebalance on imbalanced cluster.


I am not sure I understand, does this mean that the operator will only trigger a partition rebalance for goal violations that don't require manual intervention?

I assume so, that's the meaning imho.

Yes, that is correct

106-auto-rebalance-on-imbalanced-clusters.md

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

kyguy · 2025-09-22T17:25:57Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.
+It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically.


Suggested change

It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically.

It would be useful for users of Strimzi to be able to have these imbalanced clusters balanced automatically.

kyguy · 2025-09-22T17:49:38Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The above flow diagram depicts the self-healing process in Cruise Control.
+The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier.
+The notifier then decides what action to take on the anomaly whether to fix it, ignore it or delay. Cruise Control provides various notifiers to alert the users about the detected anomaly in several ways like Slack, Alerta, MS Teams etc.


Two sentences are on same line.

kyguy · 2025-09-22T18:00:22Z

106-auto-rebalance-on-imbalanced-clusters.md

+It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies.
+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
+The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
+Detector classes have different mechanisms to detect their corresponding anomalies.


Suggested change

Detector classes have different mechanisms to detect their corresponding anomalies.

Detector classes use different mechanisms to detect their corresponding anomalies.

kyguy · 2025-09-22T18:08:23Z

106-auto-rebalance-on-imbalanced-clusters.md

+Furthermore, `MetricAnomalyDetector` use metrics and `GoalViolationDetector` uses the load distribution to detect their anomalies.
+The detected anomalies can be of various types:
+* Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration.  However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR.
+* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).


Suggested change

* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).

* Topic Anomaly - When one or more topics in the cluster violate user-defined properties (e.g. some partitions are too large on disk).

kyguy · 2025-09-22T18:10:10Z

106-auto-rebalance-on-imbalanced-clusters.md

+* Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration.  However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR.
+* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
+* Broker Failure - This happens when a non-empty broker crashes or leaves a cluster for a long time.
+* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).


Suggested change

* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).

* Disk Failure - This failure happens when one of the non-empty disks fails (in a Kafka cluster with JBOD disks).

kyguy · 2025-09-22T19:14:08Z

106-auto-rebalance-on-imbalanced-clusters.md

+## Motivation
+
+Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource.
+With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own.


It's also worth noting that configuring a Kafka cluster to detect and report partition imbalances in the first place also requires manual effort. Currently, users must set up and tune the anomaly detection settings themselves. One likely benefit of implementing this feature is that it would provide sensible default configurations which would help get users started with detecting partition imbalances.

106-auto-rebalance-on-imbalanced-clusters.md

kyguy · 2025-09-22T19:32:50Z

106-auto-rebalance-on-imbalanced-clusters.md

+Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource.
+
+Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
+All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.


Suggested change

All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.

All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.

kyguy · 2025-09-22T19:41:20Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This proposal allows the users to have their cluster balanced automatically whenever the cluster gets imbalanced due to overloaded broker, CPU usage etc.
+If we were to enable the self-healing ability of Cruise Control then, in response to detected anomalies, Cruise Control would issue partition reassignments without involving the Strimzi Cluster Operator.
+This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now.


Since the self-healing feature of Cruise Control isn't being used as part of this proposal, would the two sentences above be better suited in a "Rejected Alternatives" section at the end of the proposal?

I would leave it here, but at the same time agree with Kyle to put the "self-healing provided by CC" as a rejected alternatives for the various reasons.

kyguy · 2025-09-22T19:42:46Z

106-auto-rebalance-on-imbalanced-clusters.md

+To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator.
+To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details).
+We will be using the anomaly detection classes related to goal violations that can be addressed by a partition rebalances but not other anomaly detection classes related to goal violations that would require manual intervention like disk or broker failures.
+TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.


Suggested change

TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.

scholzj

I left some comments. Some are nits, some are questions, etc. I feel like it would be great to have more clarifications on:

How are the anomalies removed from the CM
How exactly do we prevent repeated imbalances whcih cannot be fixed.

scholzj · 2025-10-06T09:28:37Z

106-auto-rebalance-on-imbalanced-clusters.md

+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
+The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
+Detector classes have different mechanisms to detect their corresponding anomalies.
+For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API.


What is Kafka Metadata API?

Yeah, what is that? AFAICS it instantiates a Kafka Admin API based client https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/KafkaBrokerFailureDetector.java#L30

scholzj · 2025-10-06T09:31:32Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes.
+The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.
+The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.


Is it always only one notifier? Or can there be more of them?

It's only just one.

scholzj · 2025-10-06T09:33:04Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+The default `NoopNotifer` always sets the notifier action as `IGNORE`, which  means that the detected anomaly will be silently ignored and no notification is sent to the user.
+
+Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.


How does something like a Sack notifier work? Does it send a message to Slack and mark the anomaly as IGNORE?

It's related to the comment I made on line 57.
There is a different between what the notifier says to the anomaly manager, so to fix or not the anomaly, and what the notifier notifies to the user.
The FIX, CHECK and IGNORE values are for the anomaly manager to understand if the anomaly should be fixed or not.
Then the notifier itself can send notification to the user or interact with the user in general in a different way.
The Slack notifier relies on the base notifier implementation to make the decision if the anomaly should be fixed or not. At the same time then it sends a message on Slack to the user. So to answer your question it doesn't always return IGNORE but always sends a Slack message with details about the anomaly and the action taken.

scholzj · 2025-10-06T09:35:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+Even under normal operation, it's common for Kafka clusters to encounter problems such as partition key skew leading to an uneven partition distribution, or hardware issues like disk failures, which can degrade overall cluster's health and performance.
+Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource.
+
+Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).


I guess if they can set the option they can set it to anything including a custom notifier? Or how does Strimzi prevent the use of custom notifier today?

There is no limitation today. You are right the users can also set their own notifier, to get notification the way they prefer. At the same time, today, they can't enable the self-healing (see below, self.healing fields are forbidden). So, long story short, today the users can leverage the anomaly detection and notifier but not using the self-healing (auto-fix anomalies).

scholzj · 2025-10-06T09:36:44Z

106-auto-rebalance-on-imbalanced-clusters.md

+Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
+All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.


So, what is the actual consequence of this? Users can use the anomaly detection and use for example an notiofier which sends them a Slack message. But no self-healing is ever done?

Exactly. I guess I answered to the same doubts in the previous question.

scholzj · 2025-10-06T09:57:09Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+#### What happens if an unfixable goal violation happens
+
+In case, there is an unfixable goal violation like `DiskDistributionUsage` goal is violated but even after rebalance we cannot fix it since the all the disks are already completely populated, in that case the notifier would simply ignore that anomaly. This is because Cruise Control provides a check to first see if the violated goal can be fixed or not by trying a dry run internally. If the violated goal is unfixable then that goal is ignored and will not be added to the ConfigMap but the user will be prompted about the unfixable violation in the status section of the Kafka CR.


Ehh, you will need to have it in the ConfigMap in order to add it to the status. So this needs more detail.

scholzj · 2025-10-06T10:02:19Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+### Auto-rebalancing execution for `imbalance` mode
+
+### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode


Suggested change

### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

#### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

scholzj · 2025-10-06T10:03:05Z

106-auto-rebalance-on-imbalanced-clusters.md

+* **RebalanceOnScaleDown**: a rebalancing related to a scale down operation is running.
+* **RebalanceOnScaleUp**: a rebalancing related to a scale up operation is running.
+
+With the new `imbalance` mode, we will be introducing a new state to the FSM called `RebalanceOnAnomalyDetection`.


Should it be RebalanceOnImbalance instead if the type is imbalance?

+1 with this suggestion

scholzj · 2025-10-06T10:04:34Z

106-auto-rebalance-on-imbalanced-clusters.md

+If, during an ongoing auto-rebalancing, the `KafkaRebalance` custom resource is not there anymore on the next reconciliation, it could mean the user deleted it while the operator was stopped/crashed/not running.
+In this case, the FSM will assume it as `NotReady` so falling in the last case above.
+
+## Affected/not affected projects


Should have backwards compatibility section as well to clarify/summarize all the compatibilit issues (the custom notifier I guess being the only one).

scholzj · 2025-10-06T10:05:16Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+## Affected/not affected projects
+
+This change will affect the Strimzi cluster operator and a new repository named `strimzi-notifier` will be added under the Strimzi organisation.


+1 of separate repository for the notifier. But that should be likely already detailed in earlier in the proposal.

ppatierno · 2025-10-20T07:36:03Z

106-auto-rebalance-on-imbalanced-clusters.md

@@ -0,0 +1,524 @@
+# Auto-rebalance on imbalanced clusters
+
+This proposal is for adding a support for auto-rebalancing a Kafka cluster when it gets imbalanced due to unevenly distributed replicas or overloaded brokers etc.


ppatierno · 2025-10-20T07:36:55Z

106-auto-rebalance-on-imbalanced-clusters.md

+# Auto-rebalance on imbalanced clusters
+
+This proposal is for adding a support for auto-rebalancing a Kafka cluster when it gets imbalanced due to unevenly distributed replicas or overloaded brokers etc.
+When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager within Cruise Control by using a corresponding KafkaRebalance custom resource (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description).


ppatierno · 2025-10-20T07:43:03Z

106-auto-rebalance-on-imbalanced-clusters.md

+Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not.
+The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration.
+Detector classes have different mechanisms to detect their corresponding anomalies.
+For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API.


Yeah, what is that? AFAICS it instantiates a Kafka Admin API based client https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/KafkaBrokerFailureDetector.java#L30

ppatierno · 2025-10-20T07:46:30Z

106-auto-rebalance-on-imbalanced-clusters.md

+The smaller the priority value is, the higher priority the anomaly type has.
+
+The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored.
+If the action is `FIX`, then the anomaly detector manager calls the classes that are required to resolve the anomaly.


I don't see you mentioned already before but this is true only if self.healing is enabled.
It's important to highlight because the goal of this proposal is to leverage the anomaly detection part only.
The Strimzi operator won't enable the self-healing by CC at all.

ppatierno · 2025-10-20T07:47:57Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes.
+The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control.
+The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property.


It's only just one.

ppatierno · 2025-10-20T08:37:56Z

106-auto-rebalance-on-imbalanced-clusters.md

+```
+
+If the users really want to have their own way of dealing with the imbalanced clusters then they can disable auto-rebalance in `imbalance` mode and use their own notifier. 
+Another way for users to use their own notifier can be to extend our notifier and use our alert method i.e `super.alert()` first in their `alert()` method implementation.


Jakub is right. They can use our notifier as a base but then it won't be used together with our "imbalance" auto-rebalancing mechanism. So maybe it doesn't make much sense advice they can extend our our notifier.

ppatierno · 2025-10-20T08:38:56Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+#### Metrics for tracking the rebalance requests
+
+If the users want to track when the auto-rebalance happened or not, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) about when the KafkaRebalance custom resources were visible/created. These metrics also cover the KafkaRebalances which were created automatically so the users can utilize them to understand when an auto-rebalance wa triggered in their cluster


Suggested change

If the users want to track when the auto-rebalance happened or not, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) about when the KafkaRebalance custom resources were visible/created. These metrics also cover the KafkaRebalances which were created automatically so the users can utilize them to understand when an auto-rebalance wa triggered in their cluster

If the users want to track when the auto-rebalance happened or not, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) about when the `KafkaRebalance` custom resources were visible/created. These metrics also cover the `KafkaRebalance`(s) which were created automatically so the users can utilize them to understand when an auto-rebalance wa triggered in their cluster

ppatierno · 2025-10-20T08:42:37Z

106-auto-rebalance-on-imbalanced-clusters.md

+* **RebalanceOnScaleDown**: a rebalancing related to a scale down operation is running.
+* **RebalanceOnScaleUp**: a rebalancing related to a scale up operation is running.
+
+With the new `imbalance` mode, we will be introducing a new state to the FSM called `RebalanceOnAnomalyDetection`.


+1 with this suggestion

ppatierno · 2025-10-20T08:43:52Z

106-auto-rebalance-on-imbalanced-clusters.md

+* from **RebalanceOnScaleDown** to:
+  * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended.
+  * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up.
+  * **Idle**: if a scale down operation was requested, it was executed and completed successfully/failed or a full rebalance was asked due to an anomaly but since the scale-down rebalance is done, we can ignore the anomalies assuming they are fixed by the rebalance. In case, they are not fixed, Cruise Control will detect them again and a new rebalance would be requested.


why is the state about "imbalance" missing here?

ppatierno · 2025-10-20T08:46:38Z

106-auto-rebalance-on-imbalanced-clusters.md

+
+This state is set since the beginning when a `Kafka` custom resource is created with the `spec.cruiseControl.autoRebalance` field.
+It is also the end state of a previous successfully completed or failed auto-rebalancing.
+In case of successful completion, once the rebalance moves to `Ready` state, we will delete the KafkaRebalance and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`.


Suggested change

In case of successful completion, once the rebalance moves to `Ready` state, we will delete the KafkaRebalance and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`.

In case of successful completion, once the rebalance moves to `Ready` state, we will delete the `KafkaRebalance` and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`.

Added proposal for self-healing

abb4253

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt requested review from ppatierno, scholzj and tomncooper July 14, 2025 11:45

ppatierno reviewed Jul 15, 2025

View reviewed changes

tomncooper reviewed Jul 15, 2025

View reviewed changes

ppatierno requested review from Frawless, PaulRMellor, im-konge, katheris, kyguy, samuel-hawker, see-quick, sknot-rh and tombentley July 15, 2025 17:12

scholzj reviewed Jul 16, 2025

View reviewed changes

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

106-self-healing-feature-in-operator.md Outdated Show resolved Hide resolved

ShubhamRwt added 3 commits July 23, 2025 13:55

Added suggestions by Jakub, Tom and Paolo

9f538ff

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Updated the heading

bf1e952

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Added future scope section

a8ce365

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

ShubhamRwt changed the title ~~Added proposal for self-healing feature in operator~~ Added proposal for auto-rebalance on imbalanced cluster feature in operator Jul 24, 2025

ppatierno reviewed Aug 1, 2025

View reviewed changes

tomncooper reviewed Aug 1, 2025

View reviewed changes

im-konge reviewed Aug 8, 2025

View reviewed changes

tinaselenge reviewed Aug 11, 2025

View reviewed changes

see-quick reviewed Aug 20, 2025

View reviewed changes

Frawless reviewed Aug 20, 2025

View reviewed changes

Added suggestions made on the proposal

cc48117

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

kyguy reviewed Sep 19, 2025

View reviewed changes

ShubhamRwt added 3 commits September 22, 2025 15:50

Refine the proposal

66cb053

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Pushed example yamls

f6c95c0

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

Minor Edits

39da9e1

Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>

kyguy reviewed Sep 22, 2025

View reviewed changes

scholzj reviewed Oct 6, 2025

View reviewed changes

ppatierno reviewed Oct 20, 2025

View reviewed changes


		## Motivation

		Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource.

	### Introduction to Self Healing
	### Introduction to Self Healing in Cruise Control


		If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier.

		#### What happens if some unfixable goal violation happens

	#### What happens if same anomaly is detected again while the auto-rebalance is happening
	#### What happens if same anomaly is detected again while the auto-rebalance is happening

	Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc.
	Some of those methods are: `onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()`.

		With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
		It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.

	With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own.
	It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected.
	In smaller clusters, anomalies can still be fixed manually. But as clusters grow, doing this becomes time-consuming or even impractical. For Strimzi users, it would be highly valuable if such anomalies could be detected and fixed automatically.

	The smaller the priority value and detected time is, the higher priority the anomaly type has.
	An anomaly is considered more important if it has a lower priority value and shorter detection time.

	This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined.
	This is not an actual rebalance request to get an optimization proposal; it is simply where the configuration for auto-rebalancing is defined.

	The user can specify rebalancing goals and other configuration for rebalancing, within the resource.
	The user can specify rebalancing goals and configuration in the resource.

	Detector classes have different mechanisms to detect their corresponding anomalies.
	Detector classes use different mechanisms to detect their corresponding anomalies.

	* Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk).
	* Topic Anomaly - When one or more topics in the cluster violate user-defined properties (e.g. some partitions are too large on disk).

	* Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks).
	* Disk Failure - This failure happens when one of the non-empty disks fails (in a Kafka cluster with JBOD disks).

	All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.
	All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.

	TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.
	The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention.


		The default `NoopNotifer` always sets the notifier action as `IGNORE`, which means that the detected anomaly will be silently ignored and no notification is sent to the user.

		Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case.

		Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.).
		All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones.


		#### What happens if an unfixable goal violation happens

		In case, there is an unfixable goal violation like `DiskDistributionUsage` goal is violated but even after rebalance we cannot fix it since the all the disks are already completely populated, in that case the notifier would simply ignore that anomaly. This is because Cruise Control provides a check to first see if the violated goal can be fixed or not by trying a dry run internally. If the violated goal is unfixable then that goal is ignored and will not be added to the ConfigMap but the user will be prompted about the unfixable violation in the status section of the Kafka CR.


		### Auto-rebalancing execution for `imbalance` mode

		### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

	### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode
	#### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode

Uh oh!

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Are you sure you want to change the base?

Added proposal for auto-rebalance on imbalanced cluster feature in operator #161

Conversation

ShubhamRwt commented Jul 14, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tomncooper left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

scholzj left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ppatierno commented Jul 17, 2025

Uh oh!

scholzj commented Jul 17, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tomncooper Aug 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ShubhamRwt Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

tomncooper left a comment •

edited

Loading

tomncooper Aug 1, 2025 •

edited

Loading

ShubhamRwt Aug 4, 2025 •

edited

Loading

tomncooper Aug 1, 2025 •

edited

Loading

im-konge Aug 8, 2025 •

edited

Loading


		## Affected/not affected projects

		This change will affect the Strimzi cluster operator and a new repository named `strimzi-notifier` will be added under the Strimzi organisation.

		@@ -0,0 +1,524 @@
		# Auto-rebalance on imbalanced clusters

		This proposal is for adding a support for auto-rebalancing a Kafka cluster when it gets imbalanced due to unevenly distributed replicas or overloaded brokers etc.


		#### Metrics for tracking the rebalance requests

		If the users want to track when the auto-rebalance happened or not, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) about when the KafkaRebalance custom resources were visible/created. These metrics also cover the KafkaRebalances which were created automatically so the users can utilize them to understand when an auto-rebalance wa triggered in their cluster

	In case of successful completion, once the rebalance moves to `Ready` state, we will delete the KafkaRebalance and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`.
	In case of successful completion, once the rebalance moves to `Ready` state, we will delete the `KafkaRebalance` and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`.