Skip to content

Conversation

@rexagod
Copy link
Member

@rexagod rexagod commented Oct 28, 2025

Goes over the details for the OptionalMonitoring capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.

Goes over the details for the `OptionalMonitoring` capability, which
targets putting the in-cluster monitoring stack in a telemetry-only
state. Note that the metric targets are not modified under this
capability itself, but only when the telemetry collection profile is
enabled.
@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Oct 28, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Oct 28, 2025

@rexagod: This pull request references MON-4414 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.21.0" version, but no target version was set.

In response to this:

Goes over the details for the OptionalMonitoring capability, which targets putting the in-cluster monitoring stack in a telemetry-only state. Note that the metric targets are not modified under this capability itself, but only when the telemetry collection profile is enabled.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Oct 28, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign simonpasquier for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rexagod rexagod changed the title MON-4414: OEP for optional monitoring MON-4414: Optional Monitoring Capability Nov 3, 2025
@rexagod rexagod marked this pull request as ready for review November 3, 2025 21:03
@openshift-ci openshift-ci bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Nov 3, 2025
@openshift-ci openshift-ci bot requested review from moadz and simonpasquier November 3, 2025 21:04
@simonpasquier
Copy link
Contributor

/unassign @moadz
/cc @jan--f

@openshift-ci openshift-ci bot requested a review from jan--f November 5, 2025 16:21
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Nov 24, 2025

@rexagod: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Comment on lines +641 to +648
4. Should we downscale Prometheus to a single replica when the
capability is disabled?
> Yes, since the monitoring footprint is reduced significantly
when the capability is disabled, moving away from an HA setup
to a single replica setup makes sense from a resource consumption
perspective. All components across OpenShift that rely on Thanos
will need to be "taught" to query the single Prometheus replica
directly instead of going through Thanos Querier.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(noting here that in this case, we will not support Genie; PTAL at this discussion)

Copy link
Contributor

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! Let's focus on the overview and external picture first before going to the implementation details :)

(a) by "optional" we mean components, or parts of components, that
are not required for [telemetry operations], and,
(b) the capability will be enabled by default (implicity enabled),
to preserve the historical UX.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
to preserve the historical UX.
for backward compatibility and because monitoring is a key feature of OpenShift.

Based on this information, we can code certain behaviors in the
monitoring operator that wouldn't otherwise make sense, and would
help reduce the overall monitoring footprint not just across the
stack, but the cluster itself (since we'd be sure of the intent),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit) "platform" (in the sense of OpenShift Container Platform) rather than "cluster"?

life-cycled, monitored and remediated at scale.
-->

> As a cluster administrator, I want to be able to disable as much
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(nit)

Suggested change
> As a cluster administrator, I want to be able to disable as much
> As a cluster administrator, I want to disable as much


> As a cluster administrator, I want to be able to disable as much
of the monitoring footprint as possible without breaking the cluster,
including any managed manifests, so that I can minimize the resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure what you mean by "managed manifests"? I'd suggest to remove the "without breaking the cluster" since in all cases we don't want that to happen.

of the monitoring footprint as possible without breaking the cluster,
including any managed manifests, so that I can minimize the resource
consumption of monitoring on my cluster.
> As a cluster administrator, I want to be able to disable as much
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe merge this intent with the previous one?
can we add an "openshift dev" story about not losing the telemetry signal?

only, it makes sense to enable the "telemetry" collection profile
when the capability is disabled. This allows us to actually regulate
the metrics ingestion from exporters that would otherwise push data
into Prometheus that is not telemetry-related.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related to my question above on layered operators, I wonder what will be the consequences when monitoring=disabled and telemetry=disabled. Do they need to implement all profiles (including telemetry) to get a signal?

when the capability is disabled, moving away from an HA setup
to a single replica setup makes sense from a resource consumption
perspective. All components across OpenShift that rely on Thanos
will need to be "taught" to query the single Prometheus replica
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the data they're interested in is available. I'd prefer to declare that when monitoring is disabled, we pretend to expose no service.

what should be the behavior?
> Since telemetry is orthogonal to the capability itself, opting
out of telemetry should lead to disabling all telemetry-related
components (e.g., exporters, telemetry-specific `PrometheusRules`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one could argue that exporters should be on.

exception, and will remain enabled to ensure that the autoscaling
pipelines are not affected.

6. Should the capability be named `OptionalMonitoring` or just `Monitoring`?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still not convinced by OptionalMonitoring. What about PlatformMonitoring?


4. Should we downscale Prometheus to a single replica when the
capability is disabled?
> Yes, since the monitoring footprint is reduced significantly
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we ok with the fact that we may have lower availability for subscription usage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants