MON-4414: Optional Monitoring Capability #1880

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

rexagod wants to merge 3 commits into openshift:master from rexagod:MON-4414

Member

rexagod commented Oct 28, 2025

Goes over the details for the OptionalMonitoring capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.


          MON-4414: OEP for optional monitoring

3d1dc32

Goes over the details for the `OptionalMonitoring` capability, which
targets putting the in-cluster monitoring stack in a telemetry-only
state. Note that the metric targets are not modified under this
capability itself, but only when the telemetry collection profile is
enabled.

openshift-ci-robot added the jira/valid-reference label

openshift-ci-robot commented Oct 28, 2025 •

edited by openshift-ci bot

Loading

@rexagod: This pull request references MON-4414 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Goes over the details for the OptionalMonitoring capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Contributor

openshift-ci bot commented Oct 28, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

openshift-ci bot added the do-not-merge/work-in-progress label

Contributor

openshift-ci bot commented Oct 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign simonpasquier for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

enhancements/monitoring/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rexagod changed the title ~~MON-4414: OEP for optional monitoring~~ MON-4414: Optional Monitoring Capability


          fixup! MON-4414: OEP for optional monitoring

c701918

rexagod marked this pull request as ready for review

November 3, 2025 21:03

openshift-ci bot removed the do-not-merge/work-in-progress label

openshift-ci bot requested review from moadz and simonpasquier

November 3, 2025 21:04

Contributor

simonpasquier commented Nov 5, 2025

/unassign @moadz
/cc @jan--f

openshift-ci bot requested a review from jan--f

November 5, 2025 16:21


          fixup! fixup! MON-4414: OEP for optional monitoring

07e8cb9

Contributor

openshift-ci bot commented Nov 24, 2025

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

rexagod commented

View reviewed changes

enhancements/monitoring/optional-monitoring-capability.md

Comment on lines +641 to +648

+. Should we downscale Prometheus to a single replica when the
+              capability is disabled?
+                  > Yes, since the monitoring footprint is reduced significantly
+                  when the capability is disabled, moving away from an HA setup
+                  to a single replica setup makes sense from a resource consumption
+                  perspective. All components across OpenShift that rely on Thanos
+                  will need to be "taught" to query the single Prometheus replica
+                  directly instead of going through Thanos Querier.

Member Author

rexagod Dec 2, 2025

(noting here that in this case, we will not support Genie; PTAL at this discussion)

simonpasquier reviewed

View reviewed changes

Contributor

simonpasquier left a comment

Good start! Let's focus on the overview and external picture first before going to the implementation details :)

enhancements/monitoring/optional-monitoring-capability.md

+              (a) by "optional" we mean components, or parts of components, that
+              are not required for [telemetry operations], and,
+              (b) the capability will be enabled by default (implicity enabled),
+              to preserve the historical UX.

Contributor

simonpasquier Dec 2, 2025

Suggested change

      
            to preserve the historical UX.
          
            for backward compatibility and because monitoring is a key feature of OpenShift.

enhancements/monitoring/optional-monitoring-capability.md

+              Based on this information, we can code certain behaviors in the
+              monitoring operator that wouldn't otherwise make sense, and would
+              help reduce the overall monitoring footprint not just across the
+              stack, but the cluster itself (since we'd be sure of the intent),

Contributor

simonpasquier Dec 2, 2025

(nit) "platform" (in the sense of OpenShift Container Platform) rather than "cluster"?

enhancements/monitoring/optional-monitoring-capability.md

+              life-cycled, monitored and remediated at scale.
+              -->
+              > As a cluster administrator, I want to be able to disable as much

Contributor

simonpasquier Dec 2, 2025

(nit)

Suggested change

      
            > As a cluster administrator, I want to be able to disable as much
          
            > As a cluster administrator, I want to disable as much

enhancements/monitoring/optional-monitoring-capability.md

+              > As a cluster administrator, I want to be able to disable as much
+              of the monitoring footprint as possible without breaking the cluster,
+              including any managed manifests, so that I can minimize the resource

Contributor

simonpasquier Dec 2, 2025

not sure what you mean by "managed manifests"? I'd suggest to remove the "without breaking the cluster" since in all cases we don't want that to happen.

enhancements/monitoring/optional-monitoring-capability.md

+              of the monitoring footprint as possible without breaking the cluster,
+              including any managed manifests, so that I can minimize the resource
+              consumption of monitoring on my cluster.
+              > As a cluster administrator, I want to be able to disable as much

Contributor

simonpasquier Dec 2, 2025

maybe merge this intent with the previous one?
can we add an "openshift dev" story about not losing the telemetry signal?

enhancements/monitoring/optional-monitoring-capability.md

+                  only, it makes sense to enable the "telemetry" collection profile
+                  when the capability is disabled. This allows us to actually regulate
+                  the metrics ingestion from exporters that would otherwise push data
+                  into Prometheus that is not telemetry-related.

Contributor

simonpasquier Dec 2, 2025

related to my question above on layered operators, I wonder what will be the consequences when monitoring=disabled and telemetry=disabled. Do they need to implement all profiles (including telemetry) to get a signal?

enhancements/monitoring/optional-monitoring-capability.md

+                  when the capability is disabled, moving away from an HA setup
+                  to a single replica setup makes sense from a resource consumption
+                  perspective. All components across OpenShift that rely on Thanos
+                  will need to be "taught" to query the single Prometheus replica

Contributor

simonpasquier Dec 2, 2025

Assuming that the data they're interested in is available. I'd prefer to declare that when monitoring is disabled, we pretend to expose no service.

enhancements/monitoring/optional-monitoring-capability.md

+              what should be the behavior?
+                  > Since telemetry is orthogonal to the capability itself, opting
+                  out of telemetry should lead to disabling all telemetry-related
+                  components (e.g., exporters, telemetry-specific `PrometheusRules`

Contributor

simonpasquier Dec 2, 2025

one could argue that exporters should be on.

enhancements/monitoring/optional-monitoring-capability.md

+                  exception, and will remain enabled to ensure that the autoscaling
+                  pipelines are not affected.
+. Should the capability be named `OptionalMonitoring` or just `Monitoring`?

Contributor

simonpasquier Dec 2, 2025

I'm still not convinced by OptionalMonitoring. What about PlatformMonitoring?

enhancements/monitoring/optional-monitoring-capability.md

+. Should we downscale Prometheus to a single replica when the
+              capability is disabled?
+                  > Yes, since the monitoring footprint is reduced significantly

Contributor

simonpasquier Dec 2, 2025

Are we ok with the fact that we may have lower availability for subscription usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference