Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-32510: change metrics-server probes for SNO #2337

Conversation

simonpasquier
Copy link
Contributor

This change switches the metrics-server's readiness probe to use the /livez endpoint instead of /readyz for single-node deployments.

By default, the /readyz endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice).

In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the /livez endpoint in this mode.

The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

@openshift-ci-robot openshift-ci-robot added jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels May 6, 2024
@openshift-ci-robot
Copy link
Contributor

@simonpasquier: This pull request references Jira Issue OCPBUGS-32510, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.16.0) matches configured target version for branch (4.16.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @juzhao

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This change switches the metrics-server's readiness probe to use the /livez endpoint instead of /readyz for single-node deployments.

By default, the /readyz endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice).

In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the /livez endpoint in this mode.

The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from juzhao May 6, 2024 08:56
@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label May 6, 2024
@openshift-ci openshift-ci bot requested review from jan--f and rexagod May 6, 2024 08:56
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 6, 2024
@simonpasquier
Copy link
Contributor Author

/assign @machine424
/assign @jan--f
/assign @slashpai

@slashpai
Copy link
Member

slashpai commented May 6, 2024

/retest

This change switches the metrics-server's readiness probe to use the
`/livez` endpoint instead of `/readyz` for single-node deployments.

It also adds a startup probe using the same characteristics as the
default readiness probe to ensure that the pod reports ready only when
it has gathered enough samples from kubelet.

By default, the `/readyz` endpoint is used to assert the component
readiness. This endpoint returns success when the metrics-server has
metric samples over 2 intervals (e.g. it has scraped at least one
kubelet twice).

In single-node deployments, it happens sometimes (especially in
end-to-end tests) that the kubelet fails to respond in a timely fashion
due to contention in cAdvisor, leading to a delayed readiness (and test
failures). To workaround the issue, we use the `/livez` endpoint in this
mode.

The long-term plan is to switch resource metrics from cAdvisor to the
CRI stats API (currently an alpha feature). Once it happens, we can
remove this change.

Signed-off-by: Simon Pasquier <spasquie@redhat.com>
@simonpasquier simonpasquier force-pushed the update-metrics-server-probe-sno branch from a2bb416 to f9670c7 Compare May 6, 2024 14:20
@simonpasquier
Copy link
Contributor Author

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1878

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@simonpasquier: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@machine424
Copy link
Contributor

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@machine424: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a33270c0-0bb6-11ef-9260-f04f1bc5bec1-0

@slashpai
Copy link
Member

slashpai commented May 6, 2024

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@slashpai: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/14e95a20-0bb8-11ef-8bea-fa1b23220f91-0

Copy link
Contributor

openshift-ci bot commented May 6, 2024

@simonpasquier: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/versions f9670c7 link false /test versions

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

"tls.key": []byte("foo"),
},
}
apiAuthConfigMapData := map[string]string{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is requestheader-client-ca-file needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

technically it isn't required by the unit test.

@simonpasquier simonpasquier changed the title [WIP] OCPBUGS-32510: change metrics-server ready probe for SNO [WIP] OCPBUGS-32510: change metrics-server probes for SNO May 7, 2024
@simonpasquier simonpasquier changed the title [WIP] OCPBUGS-32510: change metrics-server probes for SNO OCPBUGS-32510: change metrics-server probes for SNO May 7, 2024
@simonpasquier
Copy link
Contributor Author

/hold

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. and removed do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. labels May 7, 2024
@slashpai
Copy link
Member

slashpai commented May 7, 2024

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

Copy link
Contributor

openshift-ci bot commented May 7, 2024

@slashpai: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

@machine424
Copy link
Contributor

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

@machine424
Copy link
Contributor

(even though we already got a green here #2337 (comment) and no changes were pushed later. The test is failing on #2337 (comment) because of unrelated etcd events)

@simonpasquier
Copy link
Contributor Author

/skip

1 similar comment
@simonpasquier
Copy link
Contributor Author

/skip

@juzhao
Copy link
Contributor

juzhao commented May 8, 2024

tested with PR
launch 4.16.0-0.nightly-2024-05-07-025557,openshift/cluster-monitoring-operator#2337 aws,single-node
readinessProbe path changed from /readyz to /livez and startupProbe is added

$ oc -n openshift-monitoring get pod metrics-server-5cc4cd5f75-5nshz -oyaml
...
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /livez
        port: https
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: metrics-server
    ports:
    - containerPort: 10250
      name: https
      protocol: TCP
    readinessProbe:
      failureThreshold: 6
      httpGet:
        path: /livez
        port: https
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 1m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000450000
    startupProbe:
      failureThreshold: 6
      httpGet:
        path: /readyz
        port: https
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label May 8, 2024
@machine424
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label May 13, 2024
Copy link
Contributor

openshift-ci bot commented May 13, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [machine424,simonpasquier]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@slashpai
Copy link
Member

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

Copy link
Contributor

openshift-ci bot commented May 13, 2024

@slashpai: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5a5eb190-1117-11ef-93e5-df91624bf14d-0

@simonpasquier
Copy link
Contributor Author

/hold cancel

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 13, 2024
@openshift-merge-bot openshift-merge-bot bot merged commit 86b6d4b into openshift:master May 13, 2024
17 checks passed
@openshift-ci-robot
Copy link
Contributor

@simonpasquier: Jira Issue OCPBUGS-32510: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-32510 has not been moved to the MODIFIED state.

In response to this:

This change switches the metrics-server's readiness probe to use the /livez endpoint instead of /readyz for single-node deployments.

By default, the /readyz endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice).

In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the /livez endpoint in this mode.

The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change.

  • I added CHANGELOG entry for this change.
  • No user facing changes, so no entry in CHANGELOG was needed.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-bot
Copy link
Contributor

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-monitoring-operator-container-v4.17.0-202405132002.p0.g86b6d4b.assembly.stream.el9 for distgit cluster-monitoring-operator.
All builds following this will include this PR.

@simonpasquier simonpasquier deleted the update-metrics-server-probe-sno branch May 22, 2024 11:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-critical Referenced Jira bug's severity is critical for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants