-
Notifications
You must be signed in to change notification settings - Fork 526
MON-4414: Optional Monitoring Capability #1880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Goes over the details for the `OptionalMonitoring` capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.
|
@rexagod: This pull request references MON-4414 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
|
Skipping CI for Draft Pull Request. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@rexagod: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
| 4. Should we downscale Prometheus to a single replica when the | ||
| capability is disabled? | ||
| > Yes, since the monitoring footprint is reduced significantly | ||
| when the capability is disabled, moving away from an HA setup | ||
| to a single replica setup makes sense from a resource consumption | ||
| perspective. All components across OpenShift that rely on Thanos | ||
| will need to be "taught" to query the single Prometheus replica | ||
| directly instead of going through Thanos Querier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(noting here that in this case, we will not support Genie; PTAL at this discussion)
simonpasquier
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! Let's focus on the overview and external picture first before going to the implementation details :)
| (a) by "optional" we mean components, or parts of components, that | ||
| are not required for [telemetry operations], and, | ||
| (b) the capability will be enabled by default (implicity enabled), | ||
| to preserve the historical UX. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| to preserve the historical UX. | |
| for backward compatibility and because monitoring is a key feature of OpenShift. |
| Based on this information, we can code certain behaviors in the | ||
| monitoring operator that wouldn't otherwise make sense, and would | ||
| help reduce the overall monitoring footprint not just across the | ||
| stack, but the cluster itself (since we'd be sure of the intent), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit) "platform" (in the sense of OpenShift Container Platform) rather than "cluster"?
| life-cycled, monitored and remediated at scale. | ||
| --> | ||
|
|
||
| > As a cluster administrator, I want to be able to disable as much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(nit)
| > As a cluster administrator, I want to be able to disable as much | |
| > As a cluster administrator, I want to disable as much |
|
|
||
| > As a cluster administrator, I want to be able to disable as much | ||
| of the monitoring footprint as possible without breaking the cluster, | ||
| including any managed manifests, so that I can minimize the resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure what you mean by "managed manifests"? I'd suggest to remove the "without breaking the cluster" since in all cases we don't want that to happen.
| of the monitoring footprint as possible without breaking the cluster, | ||
| including any managed manifests, so that I can minimize the resource | ||
| consumption of monitoring on my cluster. | ||
| > As a cluster administrator, I want to be able to disable as much |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe merge this intent with the previous one?
can we add an "openshift dev" story about not losing the telemetry signal?
| only, it makes sense to enable the "telemetry" collection profile | ||
| when the capability is disabled. This allows us to actually regulate | ||
| the metrics ingestion from exporters that would otherwise push data | ||
| into Prometheus that is not telemetry-related. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related to my question above on layered operators, I wonder what will be the consequences when monitoring=disabled and telemetry=disabled. Do they need to implement all profiles (including telemetry) to get a signal?
| when the capability is disabled, moving away from an HA setup | ||
| to a single replica setup makes sense from a resource consumption | ||
| perspective. All components across OpenShift that rely on Thanos | ||
| will need to be "taught" to query the single Prometheus replica |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assuming that the data they're interested in is available. I'd prefer to declare that when monitoring is disabled, we pretend to expose no service.
| what should be the behavior? | ||
| > Since telemetry is orthogonal to the capability itself, opting | ||
| out of telemetry should lead to disabling all telemetry-related | ||
| components (e.g., exporters, telemetry-specific `PrometheusRules` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one could argue that exporters should be on.
| exception, and will remain enabled to ensure that the autoscaling | ||
| pipelines are not affected. | ||
|
|
||
| 6. Should the capability be named `OptionalMonitoring` or just `Monitoring`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still not convinced by OptionalMonitoring. What about PlatformMonitoring?
|
|
||
| 4. Should we downscale Prometheus to a single replica when the | ||
| capability is disabled? | ||
| > Yes, since the monitoring footprint is reduced significantly |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we ok with the fact that we may have lower availability for subscription usage?
Goes over the details for the
OptionalMonitoringcapability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.