OCPBUGS-32510: change metrics-server probes for SNO #2337

simonpasquier · 2024-05-06T08:55:45Z

This change switches the metrics-server's readiness probe to use the /livez endpoint instead of /readyz for single-node deployments.

By default, the /readyz endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice).

In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the /livez endpoint in this mode.

The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change.

I added CHANGELOG entry for this change.
No user facing changes, so no entry in CHANGELOG was needed.

openshift-ci-robot · 2024-05-06T08:55:51Z

@simonpasquier: This pull request references Jira Issue OCPBUGS-32510, which is valid.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.16.0) matches configured target version for branch (4.16.0)
bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This change switches the metrics-server's readiness probe to use the /livez endpoint instead of /readyz for single-node deployments.

By default, the /readyz endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice).

In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the /livez endpoint in this mode.

The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change.

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

simonpasquier · 2024-05-06T09:02:53Z

/assign @machine424
/assign @jan--f
/assign @slashpai

slashpai · 2024-05-06T12:59:45Z

/retest

This change switches the metrics-server's readiness probe to use the `/livez` endpoint instead of `/readyz` for single-node deployments. It also adds a startup probe using the same characteristics as the default readiness probe to ensure that the pod reports ready only when it has gathered enough samples from kubelet. By default, the `/readyz` endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice). In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the `/livez` endpoint in this mode. The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change. Signed-off-by: Simon Pasquier <spasquie@redhat.com>

simonpasquier · 2024-05-06T14:24:01Z

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1878

openshift-ci · 2024-05-06T14:24:05Z

@simonpasquier: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

machine424 · 2024-05-06T14:40:51Z

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

openshift-ci · 2024-05-06T14:41:49Z

@machine424: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a33270c0-0bb6-11ef-9260-f04f1bc5bec1-0

slashpai · 2024-05-06T14:51:11Z

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

openshift-ci · 2024-05-06T14:51:15Z

@slashpai: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/14e95a20-0bb8-11ef-8bea-fa1b23220f91-0

openshift-ci · 2024-05-06T17:17:27Z

@simonpasquier: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/versions	`f9670c7`	link	false	`/test versions`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

juzhao · 2024-05-07T03:17:31Z

pkg/manifests/manifests_test.go

+			"tls.key": []byte("foo"),
+		},
+	}
+	apiAuthConfigMapData := map[string]string{


is requestheader-client-ca-file needed?

technically it isn't required by the unit test.

simonpasquier · 2024-05-07T08:42:07Z

/hold

slashpai · 2024-05-07T09:55:01Z

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

openshift-ci · 2024-05-07T09:55:05Z

@slashpai: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

machine424 · 2024-05-07T10:03:03Z

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

machine424 · 2024-05-07T10:05:10Z

(even though we already got a green here #2337 (comment) and no changes were pushed later. The test is failing on #2337 (comment) because of unrelated etcd events)

simonpasquier · 2024-05-07T12:49:00Z

/skip

simonpasquier · 2024-05-07T13:03:39Z

/skip

juzhao · 2024-05-08T03:49:01Z

tested with PR
launch 4.16.0-0.nightly-2024-05-07-025557,openshift/cluster-monitoring-operator#2337 aws,single-node
readinessProbe path changed from /readyz to /livez and startupProbe is added

$ oc -n openshift-monitoring get pod metrics-server-5cc4cd5f75-5nshz -oyaml
...
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /livez
        port: https
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: metrics-server
    ports:
    - containerPort: 10250
      name: https
      protocol: TCP
    readinessProbe:
      failureThreshold: 6
      httpGet:
        path: /livez
        port: https
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 1m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000450000
    startupProbe:
      failureThreshold: 6
      httpGet:
        path: /readyz
        port: https
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1

/label qe-approved

machine424 · 2024-05-13T10:29:01Z

/lgtm

openshift-ci · 2024-05-13T10:29:54Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [machine424,simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

slashpai · 2024-05-13T10:55:46Z

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

openshift-ci · 2024-05-13T10:55:49Z

@slashpai: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5a5eb190-1117-11ef-93e5-df91624bf14d-0

simonpasquier · 2024-05-13T14:11:30Z

/hold cancel

openshift-ci-robot · 2024-05-13T17:01:47Z

@simonpasquier: Jira Issue OCPBUGS-32510: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/cluster-monitoring-operator#2329 is open

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-32510 has not been moved to the MODIFIED state.

In response to this:

This change switches the metrics-server's readiness probe to use the /livez endpoint instead of /readyz for single-node deployments.

By default, the /readyz endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice).

In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the /livez endpoint in this mode.

The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change.

I added CHANGELOG entry for this change.

No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

openshift-bot · 2024-05-13T21:10:31Z

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-monitoring-operator-container-v4.17.0-202405132002.p0.g86b6d4b.assembly.stream.el9 for distgit cluster-monitoring-operator.
All builds following this will include this PR.

openshift-ci bot requested a review from juzhao May 6, 2024 08:56

openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2024

openshift-ci bot requested review from jan--f and rexagod May 6, 2024 08:56

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2024

openshift-ci bot assigned jan--f, machine424 and slashpai May 6, 2024

simonpasquier force-pushed the update-metrics-server-probe-sno branch from a2bb416 to f9670c7 Compare May 6, 2024 14:20

openshift-ci bot mentioned this pull request May 6, 2024

Revert "Revert "Merge pull request #1851 from slashpai/metrics-server"" openshift/api#1865

Merged

juzhao reviewed May 7, 2024

View reviewed changes

simonpasquier changed the title ~~[WIP] OCPBUGS-32510: change metrics-server ready probe for SNO~~ [WIP] OCPBUGS-32510: change metrics-server probes for SNO May 7, 2024

simonpasquier changed the title ~~[WIP] OCPBUGS-32510: change metrics-server probes for SNO~~ OCPBUGS-32510: change metrics-server probes for SNO May 7, 2024

openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 7, 2024

openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 8, 2024

openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2024

openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2024

openshift-merge-bot bot merged commit 86b6d4b into openshift:master May 13, 2024
17 checks passed

simonpasquier deleted the update-metrics-server-probe-sno branch May 22, 2024 11:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCPBUGS-32510: change metrics-server probes for SNO #2337

OCPBUGS-32510: change metrics-server probes for SNO #2337

simonpasquier commented May 6, 2024

openshift-ci-robot commented May 6, 2024

simonpasquier commented May 6, 2024

slashpai commented May 6, 2024

simonpasquier commented May 6, 2024

openshift-ci bot commented May 6, 2024

machine424 commented May 6, 2024

openshift-ci bot commented May 6, 2024

slashpai commented May 6, 2024

openshift-ci bot commented May 6, 2024

openshift-ci bot commented May 6, 2024 •

edited

Loading

juzhao May 7, 2024

simonpasquier May 7, 2024

simonpasquier commented May 7, 2024

slashpai commented May 7, 2024

openshift-ci bot commented May 7, 2024

machine424 commented May 7, 2024

machine424 commented May 7, 2024

simonpasquier commented May 7, 2024

simonpasquier commented May 7, 2024

juzhao commented May 8, 2024

machine424 commented May 13, 2024

openshift-ci bot commented May 13, 2024

slashpai commented May 13, 2024

openshift-ci bot commented May 13, 2024

simonpasquier commented May 13, 2024

openshift-ci-robot commented May 13, 2024

openshift-bot commented May 13, 2024

OCPBUGS-32510: change metrics-server probes for SNO #2337

OCPBUGS-32510: change metrics-server probes for SNO #2337

Conversation

simonpasquier commented May 6, 2024

openshift-ci-robot commented May 6, 2024

simonpasquier commented May 6, 2024

slashpai commented May 6, 2024

simonpasquier commented May 6, 2024

openshift-ci bot commented May 6, 2024

machine424 commented May 6, 2024

openshift-ci bot commented May 6, 2024

slashpai commented May 6, 2024

openshift-ci bot commented May 6, 2024

openshift-ci bot commented May 6, 2024 • edited Loading

juzhao May 7, 2024

Choose a reason for hiding this comment

simonpasquier May 7, 2024

Choose a reason for hiding this comment

simonpasquier commented May 7, 2024

slashpai commented May 7, 2024

openshift-ci bot commented May 7, 2024

machine424 commented May 7, 2024

machine424 commented May 7, 2024

simonpasquier commented May 7, 2024

simonpasquier commented May 7, 2024

juzhao commented May 8, 2024

machine424 commented May 13, 2024

openshift-ci bot commented May 13, 2024

slashpai commented May 13, 2024

openshift-ci bot commented May 13, 2024

simonpasquier commented May 13, 2024

openshift-ci-robot commented May 13, 2024

openshift-bot commented May 13, 2024

openshift-ci bot commented May 6, 2024 •

edited

Loading