-
Couldn't load subscription status.
- Fork 73
Added proposal for auto-rebalance on imbalanced cluster feature in operator #161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did a first pass. I think this is a better proposal which is more in line with how Strimzi currently uses CC.
I think you need more detail on the interaction with the current auto-rebalancing and also a clearer description of the FSM states and their transitions. I found it hard to follow the sequence you are proposing.
For the notifier, I actually think we should stop users using custom notifiers (we could make it conditional on the full mode being set or not). As we are creating K8s resources in response to detected anomalies users can create alerting based on that if they need it. If users do need that then we could provide implementations of the various notifiers which extend our notifier rather than the CC one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going in the right direction. But I think it needs to go a bit deeper:
- We need to establish our own terminology and not take over the Cruise Control one. There is not really any self-healing and most of the anomalies are not really anomalies.
- If I read this proposal right, you want to focus on when the cluster is out-of-balance. That is a great start. But perhaps that should not be called
mode: full? Calling itfullseems confusing - does it mean thatfullincludes scale-up / scale-down? Also, I guess in the future we would add some actual self-healing to handle the broken disks or brokers. That might create additional modes probably. So maybemode: rebalanceormode: skewor something like that would make more sense?
|
@scholzj good to know that you like the track we are now :-) Regarding the "full" related naming, we were just reusing the underneath mode naming for the |
I do not think this works here. The
|
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
|
|
||
| This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c. | ||
| When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource. | ||
| Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a repetition of the above sentence. Maybe you can delete it but adding the link to the anomaly detector manager to the previous sentence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
|
||
| ## Motivation | ||
|
|
||
| Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How a user is notified by anomalies currently? What are you referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.
| With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own. | ||
| It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected. | ||
|
|
||
| ### Introduction to Self Healing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Introduction to Self Healing | |
| ### Introduction to Self Healing in Cruise Control |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."
|
|
||
| The above flow diagram depicts the self-healing process in Cruise Control. | ||
| The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier. | ||
| The configured notifiers provides alerts to the users about the detected anomaly and also returns the action that needs to be taken on the anomaly i.e. whether to fix it, ignore it or delay it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The alert mechanism isn't out of the box. A notifier can have its own logic without generating any alerts. Even just triggering the fix without notifying anyone what's happening. So there is no assumption that a "configured" notifier provides alerts. I think this sentence should say that the notifier makes the decision about the action to take. Then CC provides some notifiers which are able to alert the user in several ways (MS Teams, Slack, etc etc).
|
|
||
| If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier. | ||
|
|
||
| #### What happens if some unfixable goal violation happens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### What happens if some unfixable goal violation happens | |
| #### What happens if some unfixable goal violation happens | |
| If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier. | ||
|
|
||
| #### What happens if some unfixable goal violation happens | ||
| In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still need an example here for better understanding how this is prompted to the user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having prometheus metrics for such cases might be reasonable default way?
| #### What happens if some unfixable goal violation happens | ||
| In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section. | ||
|
|
||
| #### What happens if same anomaly is detected again while the auto-rebalance is happening |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #### What happens if same anomaly is detected again while the auto-rebalance is happening | |
| #### What happens if same anomaly is detected again while the auto-rebalance is happening | |
| In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section. | ||
|
|
||
| #### What happens if same anomaly is detected again while the auto-rebalance is happening | ||
| Since the cluster operator has the knowledge regarding the detected violation, we will ignore the anomalies while the rebalancing is happening. In case the anomaly still exists after the rebalance, Cruise Control will detect it again and a new rebalance would be triggered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so it seems to assume that if a first anomaly is created, the notifier creates the corresponding ConfigMap and the CO takes care of running a rebalancing. While the rebalancing is running, CC detects other anomalies, so the notifier is creating a bunch of other ConfigMaps that the CO is ignoring. Finally, the rebalancing ends ... the CO will find all these ConfigMaps ... what's going to do? This is where if it takes care of them we could:
- lose the priority of them (ConfigMaps don't have priority)
- the old anomalies could have been fixed by the previous rebalancing so it's useless handling them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the best option in this case would be to ignore the configmap and also delete it at the same time. I think I didn't mentioned it here which my mistake but later in the flowchart, I say that if anomalies are detected during a rebalance is happening, we will just gnore that configmap and delete it.
| * from **RebalanceOnScaleDown** to: | ||
| * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended. | ||
| * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up. | ||
| * **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are we really sure that if a rebalance is running for scale up or scale down, after that we should take care of the anomaly? Is it possible that the anomaly was somehow fixed because of the auto-rebalancing due to scale up or down? My gut feeling is that we could avoid to take care of an anomaly even because if the problem is still in place, it will be raised again by CC and then we'll deal with it. @tomncooper @scholzj wdyt?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?
Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.
I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think I can really comment as I do not know how this really works in CC. I raised similar point before with regards to imbalance that cannot be fixed (e.g. because one partition causing the imbalance is too big etc.). Will it be raised again and again? Do we need to somehow detedt those and ignore them? Etc. So this is a bit similar. How do you know it was already resolved or not and will it be repeated or not. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tomncooper @ppatierno you are correct, we should ignore and delete the configmap at the same time if rebalance is happening. I think I didn't mentioned it here which is my mistake but later in the flowchart, I show that if anomalies are detected during a rebalance is happening, we will just ignore that configmap and delete it. As for unfixable anomalies which can keep appearing, there is code present in the Cruise Control SelfHealingNotifier which I am going to utilize. That method checks if the rebalance can be performed on the Goal violation or not. If the goal vioaltion cannot be fixed, then we just ognore the anomaly and no configmap would be created in that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I did another pass. I have a few questions:
- How are you going to distinguish anomaly CMs from different Kafka clusters in the same namespace. I know it is not recommened, but user do deploy multiple Kafka clusters in the same NS.
- You need to deal with GC'ing all these anomaly CMs in the case where a rebalance is on going. Do you delete them? Do you have some kind of timeout based on the detection interval?
- It is not clear what you mean by scale up/down auto-rebalances being queued up? I assume you mean generated
KafkaRebalanceCRs? But it is not clear.
|
|
||
| ## Motivation | ||
|
|
||
| Currently, any anomaly that the user is notified about would need to be fixed manually by using the `KafkaRebalance` custom resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, this isn't enabled by default. The user could configure notification but most (I assume) don't.
| With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own. | ||
| It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected. | ||
|
|
||
| ### Introduction to Self Healing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need a short intro to why this section is here: "In order to set the context, for how we plan to automatically fix unbalanced Kafka clusters, the sections below go over how Cruise Control's anomaly detection and self-healing features work..."
| * from **RebalanceOnScaleDown** to: | ||
| * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended. | ||
| * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up. | ||
| * **RebalanceOnAnomalyDetection**: if a configmap related to goal violation was detected. It will run once the queued scale down and scale up is completed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this ties into your question above Paolo, what happens if a load of Anomaly CMs stack up while you are waiting for scale up or scale down rebalance to finish?
Even if only onr anomaly is detected and a CM created, it could be hours old by the time the scaling operation and rebalance is done. The add/remove-broker rebalances can apply goal fixes as well so they may well fix the original anomaly.
I think you need the concept of freshness for an anomaly. You could just blanket reject (delete) any anomalies detected during an ongoing rebalance.
|
|
||
| This proposal is about adding support for auto-rebalancing the Kafka cluster in case it gets imbalanced due to some issues like unevenly distributed replicas or overloaded brokers e.t.c. | ||
| When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager by running KafkaRebalance via Cruise Control using the KafkaRebalance resource. | ||
| Anomalies are detected by Cruise Control using the anomaly detector manager (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| finalizers: | ||
| - strimzi.io/auto-rebalancing | ||
| spec: | ||
| mode: skew |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it will be always full and there will be no new mode like imbalance or skew, right?
| #### AnomalyDetectorNotifier | ||
|
|
||
| Cruise Control provides the `AnomalyNotifier` interface, which has multiple abstract methods on what to do if certain anomalies are detected. | ||
| Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Some of those methods are:`onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure`, `alert()` etc. | |
| Some of those methods are: `onGoalViolation()`, `onBrokerFailure()`, `onDiskFailure()`, `alert()`. |
I guess you don't have to use etc. here if you are just naming some of them, but I'm not a native speaker :)
| # ... | ||
| ``` | ||
|
|
||
| The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah should it follow the names of other things like <cluster-name>-goal-violation-<anomalyID>? I guess that will be also easier to find in case that you would like to search all Namespaces for these kind of ConfigMap.
|
|
||
| Users cannot configure the notifier if they are utilising the auto-rebalance on imbalanced cluster. | ||
| This is because the operator is using our custom notifier for getting alerts about goal violations. | ||
| If the users try to override the notifier while the `skew` mode is enabled, the auto-rebalance `skew` configuration then the operator would throw errors in the auto-rebalance status field |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe could you re-phrase it a bit - I'm confused a bit by the the auto-rebalance skew configuration then the operator would throw errors. What do you mean by that?
| finalizers: | ||
| - strimzi.io/auto-rebalancing | ||
| spec: | ||
| mode: skew |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I would get confused to see different mode name, full here. We would have to explain how that map to skew or imbalance mode we introduced.
|
Chiming in as an end user - glad to see this proposal! We have been debating internally if we want to have a cronjob to issue rebalances, this is a lot better. In particular the model of using CruiseControl's anomaly detection while issuing the rebalances through KafkaRebalance CRs seems like it will fit perfect into our workflows. I see discussion on how to represent the anomalies. Any solution here is fine for us, I envision we will mostly be interacting with the KafkaRebalance CR and not much with anything else. An area that could be explicit is the right way to stop all rebalances and not issue any more. Rebalance operations often saturate bandwidth, either disk or network, and cause major latency during producing. We often find ourselves needing to cancel them as we scale and learn our limits. It looks like we might be able to delete Thanks for putting this together, excited to see this. |
|
@nickgarvey Thanks for the feedback! Usually you are able to stop the current rebalancing by applying the |
| With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own. | ||
| It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the anomalies on your own. | |
| It would be useful for users of Strimzi to be able to have these anomalies fixed automatically whenever they are detected. | |
| In smaller clusters, anomalies can still be fixed manually. But as clusters grow, doing this becomes time-consuming or even impractical. For Strimzi users, it would be highly valuable if such anomalies could be detected and fixed automatically. |
| * Metric anomaly - This failure happens if metrics collected by Cruise Control have some anomaly in their value (e.g. a sudden rise in the log flush time metrics). | ||
|
|
||
| The detected anomalies are inserted into a priority queue where comparator is based upon the priority value and the detection time. | ||
| The smaller the priority value and detected time is, the higher priority the anomaly type has. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The smaller the priority value and detected time is, the higher priority the anomaly type has. | |
| An anomaly is considered more important if it has a lower priority value and shorter detection time. |
| They can configure auto-rebalance to enable only for their specific case i.e. setting only `skew` mode or other scaling related modes. | ||
| Once the auto-rebalance with `skew` mode is enabled, the operator will be ready to trigger auto-rebalance whenever the cluster becomes imbalanced. | ||
| To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly. | ||
| We will create our own custom notifier named `AnomalyDetectorNotifier` to do the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that would make it more flexible for future changes, so +1 for naming it more generic way...
| The auto-rebalance configuration for the `spec.cruiseControl.autoRebalance.template` property in the `Kafka` custom resource is provided through a `KafkaRebalance` custom resource defined as a "template". | ||
| That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set. | ||
| When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing. | ||
| This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined. | |
| This is not an actual rebalance request to get an optimization proposal; it is simply where the configuration for auto-rebalancing is defined. |
| That is a `KafkaRebalance` custom resource with the `strimzi.io/rebalance-template: true` annotation set. | ||
| When it is created, the `KafkaRebalanceAssemblyOperator` doesn't run any rebalancing. | ||
| This is because it doesn't represent an "actual" rebalance request to get an optimization proposal, but it's just the place where configuration related to auto-rebalancing is defined. | ||
| The user can specify rebalancing goals and other configuration for rebalancing, within the resource. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| The user can specify rebalancing goals and other configuration for rebalancing, within the resource. | |
| The user can specify rebalancing goals and configuration in the resource. |
| ``` | ||
|
|
||
| The operator will then check if any configmap with prefix `goal-violation` is created or not, if it finds one created then operator will trigger the rebalance. | ||
| Separate configmaps would be created for every goal violation such that on completion of the rebalance we can remove the particular configmap. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guess the operator will remove to CM instead of users, right?
| If the users really want to have their own way of dealing with the imbalanced clusters then they can just disable auto-rebalance in `skew` mode and use their own notifier. | ||
|
|
||
| #### What happens if some unfixable goal violation happens | ||
| In case, there is an unfixable goal violation then the notifier would simply ignore that anomaly and prompt the user about the unfixable violation in the auto-rebalancing status section. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Having prometheus metrics for such cases might be reasonable default way?
| A[KafkaClusterCreator] --creates--> B[KafkaCluster] | ||
| B -- calls --> D[KafkaAutoRebalancingReconciler.reconcile] | ||
| D -- check for configmap with goal-violation prefix --> E{if config map present?} | ||
| D -- if rebalance in progress --> F[ignore new configmaps and delete them] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which CMs will be deleted in that case? My understanding is that KafkaAutoRebalancingReconciler will not create new ones. What about these CMs that are used for current rebalancing? These should be deleted by rebalancing itself or?
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
| This field is optional and if not specified, the auto-rebalancing runs with the default Cruise Control configuration (i.e. the same used for unmodified manual `KafkaRebalance` invocations). | ||
| To provide users more flexibility, they only have to configure the auto-rebalance modes they wish to customise. | ||
| They don't require to set up all the modes and can enable the modes they require. | ||
| They can configure auto-rebalance to enable only for their specific case i.e. setting only `imbalance` mode or other scaling related modes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like the first sentence of the above three summarizes this well enough, we could probably remove the bottom two sentences.
| To trigger the auto-rebalance, the operator must know that the cluster is imbalanced due to some goal violation anomaly. | ||
| We will create our own custom notifier named `StrimziCruiseControlNotifier` to do the same. | ||
| This notifier's job will be to update the operator regarding the goal violations so that the operator can trigger a rebalance (see section [AnomalyDetectorNotifier](./106-auto-rebalance-on-imbalanced-clusters.md#anomalydetectornotifier)). | ||
| With this proposal, we are only going to support auto-rebalance on imbalanced cluster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I understand, does this mean that the operator will only trigger a partition rebalance for goal violations that don't require manual intervention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume so, that's the meaning imho.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that is correct
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
Signed-off-by: ShubhamRwt <shubhamrwt02@gmail.com>
|
|
||
| Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource. | ||
| With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own. | ||
| It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| It would be useful for users of Strimzi to be able to have these imbalanced cluster balanced automatically. | |
| It would be useful for users of Strimzi to be able to have these imbalanced clusters balanced automatically. |
|
|
||
| The above flow diagram depicts the self-healing process in Cruise Control. | ||
| The anomaly detector manager detects an anomaly (using the detector classes) and forwards it to the notifier. | ||
| The notifier then decides what action to take on the anomaly whether to fix it, ignore it or delay. Cruise Control provides various notifiers to alert the users about the detected anomaly in several ways like Slack, Alerta, MS Teams etc. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two sentences are on same line.
| It acts as a coordinator between the detector classes and the classes which will handle resolving the anomalies. | ||
| Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not. | ||
| The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration. | ||
| Detector classes have different mechanisms to detect their corresponding anomalies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| Detector classes have different mechanisms to detect their corresponding anomalies. | |
| Detector classes use different mechanisms to detect their corresponding anomalies. |
| Furthermore, `MetricAnomalyDetector` use metrics and `GoalViolationDetector` uses the load distribution to detect their anomalies. | ||
| The detected anomalies can be of various types: | ||
| * Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration. However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR. | ||
| * Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk). | |
| * Topic Anomaly - When one or more topics in the cluster violate user-defined properties (e.g. some partitions are too large on disk). |
| * Goal Violation - This happens if certain [optimization goals](https://strimzi.io/docs/operators/in-development/deploying#optimization_goals) are violated (e.g. DiskUsageDistributionGoal etc.). These goals can be configured through the `self.healing.goals` option in Cruise Control configuration. However, this option is forbidden in the `spec.cruiseControl.config` section of the `Kafka` CR. | ||
| * Topic Anomaly - Where one or more topics in cluster violates user-defined properties (e.g. some partitions are too large in disk). | ||
| * Broker Failure - This happens when a non-empty broker crashes or leaves a cluster for a long time. | ||
| * Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| * Disk Failure - This failure happens if one of the non-empty disks fails (related to a Kafka Cluster with JBOD disks). | |
| * Disk Failure - This failure happens when one of the non-empty disks fails (in a Kafka cluster with JBOD disks). |
| ## Motivation | ||
|
|
||
| Currently, if the cluster is imbalanced, the user would need to manually rebalance the cluster by using the `KafkaRebalance` custom resource. | ||
| With smaller clusters, it is feasible to fix things manually. However, for larger ones it can be very time-consuming, or just not feasible, to fix all the imbalances on your own. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also worth noting that configuring a Kafka cluster to detect and report partition imbalances in the first place also requires manual effort. Currently, users must set up and tune the anomaly detection settings themselves. One likely benefit of implementing this feature is that it would provide sensible default configurations which would help get users started with detecting partition imbalances.
| Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource. | ||
|
|
||
| Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.). | ||
| All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones. | |
| All the `self.healing` prefixed properties are currently disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones. |
|
|
||
| This proposal allows the users to have their cluster balanced automatically whenever the cluster gets imbalanced due to overloaded broker, CPU usage etc. | ||
| If we were to enable the self-healing ability of Cruise Control then, in response to detected anomalies, Cruise Control would issue partition reassignments without involving the Strimzi Cluster Operator. | ||
| This could cause potential conflicts with other administration operations and is the primary reason self-healing has been disabled until now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the self-healing feature of Cruise Control isn't being used as part of this proposal, would the two sentences above be better suited in a "Rejected Alternatives" section at the end of the proposal?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would leave it here, but at the same time agree with Kyle to put the "self-healing provided by CC" as a rejected alternatives for the various reasons.
| To resolve this issue, we will only make use of Cruise Control's anomaly detection ability, the triggering of the partition reassignments (rebalance) will the responsibility of the Strimzi Cluster Operator. | ||
| To enable this, we will use approach based on the existing auto-rebalance for scaling feature (see the [documentation](https://strimzi.io/docs/operators/latest/deploying#proc-automating-rebalances-str) for more details). | ||
| We will be using the anomaly detection classes related to goal violations that can be addressed by a partition rebalances but not other anomaly detection classes related to goal violations that would require manual intervention like disk or broker failures. | ||
| TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| TThe reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention. | |
| The reason behind thus is that disk failures and broker failures can cannot be fixed by rebalancing alone, they require manual intervention. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. Some are nits, some are questions, etc. I feel like it would be great to have more clarifications on:
- How are the anomalies removed from the CM
- How exactly do we prevent repeated imbalances whcih cannot be fixed.
| Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not. | ||
| The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration. | ||
| Detector classes have different mechanisms to detect their corresponding anomalies. | ||
| For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is Kafka Metadata API?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what is that? AFAICS it instantiates a Kafka Admin API based client https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/KafkaBrokerFailureDetector.java#L30
|
|
||
| Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes. | ||
| The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control. | ||
| The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it always only one notifier? Or can there be more of them?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only just one.
|
|
||
| The default `NoopNotifer` always sets the notifier action as `IGNORE`, which means that the detected anomaly will be silently ignored and no notification is sent to the user. | ||
|
|
||
| Cruise Control also provides [custom notifiers](https://github.com/linkedin/cruise-control/wiki/Configure-notifications) like Slack Notifier, Alerta Notifier etc. for notifying users regarding the anomalies. There are multiple other [self-healing notifier](https://github.com/linkedin/cruise-control/wiki/Configurations#selfhealingnotifier-configurations) related configurations you can use to make notifiers more efficient as per the use case. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does something like a Sack notifier work? Does it send a message to Slack and mark the anomaly as IGNORE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's related to the comment I made on line 57.
There is a different between what the notifier says to the anomaly manager, so to fix or not the anomaly, and what the notifier notifies to the user.
The FIX, CHECK and IGNORE values are for the anomaly manager to understand if the anomaly should be fixed or not.
Then the notifier itself can send notification to the user or interact with the user in general in a different way.
The Slack notifier relies on the base notifier implementation to make the decision if the anomaly should be fixed or not. At the same time then it sends a message on Slack to the user. So to answer your question it doesn't always return IGNORE but always sends a Slack message with details about the anomaly and the action taken.
| Even under normal operation, it's common for Kafka clusters to encounter problems such as partition key skew leading to an uneven partition distribution, or hardware issues like disk failures, which can degrade overall cluster's health and performance. | ||
| Currently, in any such scenario these issues need to be fixed manually i.e. if the cluster is imbalanced then a user might instruct Cruise Control to move the partition replicas across the brokers in order to fix the imbalance using the `KafkaRebalance` custom resource. | ||
|
|
||
| Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess if they can set the option they can set it to anything including a custom notifier? Or how does Strimzi prevent the use of custom notifier today?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no limitation today. You are right the users can also set their own notifier, to get notification the way they prefer. At the same time, today, they can't enable the self-healing (see below, self.healing fields are forbidden). So, long story short, today the users can leverage the anomaly detection and notifier but not using the self-healing (auto-fix anomalies).
| Users can currently enable anomaly detection and can also [set](https://strimzi.io/docs/operators/latest/full/deploying.html#setting_up_alerts_for_anomaly_detection) the notifier to one of those included with Cruise Control (`SelfHealingNotifier`, `AlertaSelfHealingNotifier`, `SlackSelfHealingNotifier` etc.). | ||
| All the `self.healing` prefixed properties were disabled in Strimzi's Cruise Control integration because, initially, it was not clear how self-healing would act if pods were rolled in middle of rebalances or how Strimzi triggered manual rebalances should interact with Cruise Control triggered self-healing ones. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, what is the actual consequence of this? Users can use the anomaly detection and use for example an notiofier which sends them a Slack message. But no self-healing is ever done?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. I guess I answered to the same doubts in the previous question.
|
|
||
| #### What happens if an unfixable goal violation happens | ||
|
|
||
| In case, there is an unfixable goal violation like `DiskDistributionUsage` goal is violated but even after rebalance we cannot fix it since the all the disks are already completely populated, in that case the notifier would simply ignore that anomaly. This is because Cruise Control provides a check to first see if the violated goal can be fixed or not by trying a dry run internally. If the violated goal is unfixable then that goal is ignored and will not be added to the ConfigMap but the user will be prompted about the unfixable violation in the status section of the Kafka CR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ehh, you will need to have it in the ConfigMap in order to add it to the status. So this needs more detail.
|
|
||
| ### Auto-rebalancing execution for `imbalance` mode | ||
|
|
||
| ### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| ### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode | |
| #### Auto-rebalancing Finite State Machine (FSM) for `imbalance` mode |
| * **RebalanceOnScaleDown**: a rebalancing related to a scale down operation is running. | ||
| * **RebalanceOnScaleUp**: a rebalancing related to a scale up operation is running. | ||
|
|
||
| With the new `imbalance` mode, we will be introducing a new state to the FSM called `RebalanceOnAnomalyDetection`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be RebalanceOnImbalance instead if the type is imbalance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with this suggestion
| If, during an ongoing auto-rebalancing, the `KafkaRebalance` custom resource is not there anymore on the next reconciliation, it could mean the user deleted it while the operator was stopped/crashed/not running. | ||
| In this case, the FSM will assume it as `NotReady` so falling in the last case above. | ||
|
|
||
| ## Affected/not affected projects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should have backwards compatibility section as well to clarify/summarize all the compatibilit issues (the custom notifier I guess being the only one).
|
|
||
| ## Affected/not affected projects | ||
|
|
||
| This change will affect the Strimzi cluster operator and a new repository named `strimzi-notifier` will be added under the Strimzi organisation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 of separate repository for the notifier. But that should be likely already detailed in earlier in the proposal.
| @@ -0,0 +1,524 @@ | |||
| # Auto-rebalance on imbalanced clusters | |||
|
|
|||
| This proposal is for adding a support for auto-rebalancing a Kafka cluster when it gets imbalanced due to unevenly distributed replicas or overloaded brokers etc. | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| # Auto-rebalance on imbalanced clusters | ||
|
|
||
| This proposal is for adding a support for auto-rebalancing a Kafka cluster when it gets imbalanced due to unevenly distributed replicas or overloaded brokers etc. | ||
| When enabled, the Strimzi operator should automatically resolve these issues detected by the Anomaly Detector Manager within Cruise Control by using a corresponding KafkaRebalance custom resource (see section [ Anomaly Detector Manager](./106-auto-rebalance-on-imbalanced-clusters.md#anomaly-detector-manager) below for a detailed description). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
| Various detector classes like `GoalViolationDetector`, `DiskFailureDetector`, `KafkaBrokerFailureDetector` etc. are used for the anomaly detection, which runs periodically to check if the cluster has their corresponding anomalies or not. | ||
| The frequency of this check can be changed via the `anomaly.detection.interval.ms` configuration. | ||
| Detector classes have different mechanisms to detect their corresponding anomalies. | ||
| For example, `KafkaBrokerFailureDetector` utilises Kafka Metadata API whereas `DiskFailureDetector` and `TopicAnomalyDetector` utilises Kafka Admin API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, what is that? AFAICS it instantiates a Kafka Admin API based client https://github.com/linkedin/cruise-control/blob/main/cruise-control/src/main/java/com/linkedin/kafka/cruisecontrol/detector/KafkaBrokerFailureDetector.java#L30
| The smaller the priority value is, the higher priority the anomaly type has. | ||
|
|
||
| The anomaly detector manager calls the notifier to get an action regarding whether the anomaly should be fixed, delayed, or ignored. | ||
| If the action is `FIX`, then the anomaly detector manager calls the classes that are required to resolve the anomaly. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see you mentioned already before but this is true only if self.healing is enabled.
It's important to highlight because the goal of this proposal is to leverage the anomaly detection part only.
The Strimzi operator won't enable the self-healing by CC at all.
|
|
||
| Whenever anomalies are detected, Cruise Control provides the ability to notify the user regarding the detected anomalies using optional notifier classes. | ||
| The notification sent by these classes increases the visibility of the operations that are taken by Cruise Control. | ||
| The notifier class used by Cruise Control is configurable and custom notifiers can be used by setting the `anomaly.notifier.class` property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's only just one.
| ``` | ||
|
|
||
| If the users really want to have their own way of dealing with the imbalanced clusters then they can disable auto-rebalance in `imbalance` mode and use their own notifier. | ||
| Another way for users to use their own notifier can be to extend our notifier and use our alert method i.e `super.alert()` first in their `alert()` method implementation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Jakub is right. They can use our notifier as a base but then it won't be used together with our "imbalance" auto-rebalancing mechanism. So maybe it doesn't make much sense advice they can extend our our notifier.
|
|
||
| #### Metrics for tracking the rebalance requests | ||
|
|
||
| If the users want to track when the auto-rebalance happened or not, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) about when the KafkaRebalance custom resources were visible/created. These metrics also cover the KafkaRebalances which were created automatically so the users can utilize them to understand when an auto-rebalance wa triggered in their cluster |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| If the users want to track when the auto-rebalance happened or not, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) about when the KafkaRebalance custom resources were visible/created. These metrics also cover the KafkaRebalances which were created automatically so the users can utilize them to understand when an auto-rebalance wa triggered in their cluster | |
| If the users want to track when the auto-rebalance happened or not, they can access the Strimzi [metrics](https://github.com/strimzi/strimzi-kafka-operator/blob/main/examples/metrics/grafana-dashboards/strimzi-operators.json#L712) about when the `KafkaRebalance` custom resources were visible/created. These metrics also cover the `KafkaRebalance`(s) which were created automatically so the users can utilize them to understand when an auto-rebalance wa triggered in their cluster |
| * **RebalanceOnScaleDown**: a rebalancing related to a scale down operation is running. | ||
| * **RebalanceOnScaleUp**: a rebalancing related to a scale up operation is running. | ||
|
|
||
| With the new `imbalance` mode, we will be introducing a new state to the FSM called `RebalanceOnAnomalyDetection`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with this suggestion
| * from **RebalanceOnScaleDown** to: | ||
| * **RebalanceOnScaleDown**: if a rebalancing on scale down is still running or another one was requested while the first one ended. | ||
| * **RebalanceOnScaleUp**: if a scale down operation was requested together with a scale up and, because they run sequentially, the rebalance on scale down had the precedence, was executed first and completed successfully. We can now move on with rebalancing for the scale up. | ||
| * **Idle**: if a scale down operation was requested, it was executed and completed successfully/failed or a full rebalance was asked due to an anomaly but since the scale-down rebalance is done, we can ignore the anomalies assuming they are fixed by the rebalance. In case, they are not fixed, Cruise Control will detect them again and a new rebalance would be requested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is the state about "imbalance" missing here?
|
|
||
| This state is set since the beginning when a `Kafka` custom resource is created with the `spec.cruiseControl.autoRebalance` field. | ||
| It is also the end state of a previous successfully completed or failed auto-rebalancing. | ||
| In case of successful completion, once the rebalance moves to `Ready` state, we will delete the KafkaRebalance and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| In case of successful completion, once the rebalance moves to `Ready` state, we will delete the KafkaRebalance and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`. | |
| In case of successful completion, once the rebalance moves to `Ready` state, we will delete the `KafkaRebalance` and move the empty the `anomaly_list` then update the `auto-rebalance` state to `Idle`. |
This PR aims to introduce the self-healing feature in Strimzi. This proposal contains all the comments and suggestion left on the old proposal #145. This proposal aim to utilize the
auto-rebalancingfeature of Strimzi to introduce the self healing.