Skip to content

Disable metrics to avoid pod collisions #403

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

Disable metrics to avoid pod collisions. This was enabled by default it https://github.com/kubernetes-sigs/controller-runtime/pull/510/files
@openshift-ci-robot openshift-ci-robot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Oct 4, 2019
@spangenberg
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 4, 2019
@spangenberg
Copy link
Contributor

/approve

@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: spangenberg

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci-robot openshift-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 4, 2019
@wking
Copy link
Member

wking commented Oct 4, 2019

Is this a short-term fix while we get leader election or some such to avoid metrics collision? Doesn't seem like something we want to turn off completely in the long term.

@enxebre
Copy link
Member Author

enxebre commented Oct 4, 2019

Is this a short-term fix while we get leader election or some such to avoid metrics collision? Doesn't seem like something we want to turn off completely in the long term.

@wking we have never yet exposed the controller runtime metrics. You can check the ones we expose here https://github.com/openshift/machine-api-operator/tree/master/pkg/metrics so this is just putting us back to where we were.
By bumping controller runtime https://github.com/kubernetes-sigs/controller-runtime/pull/510/files
controllers just started to expose 8080. Since multiple containers live in the same pod they started clashing. Once we enable controller runtime metrics we'll figure out how to best accommodate them for all containers

@wking
Copy link
Member

wking commented Oct 4, 2019

upgrade failed:

fail [k8s.io/kubernetes/test/e2e/framework/util.go:1674]: Unexpected error:
    <*errors.errorString | 0xc003442220>: {
        s: "failed to get logs from pod-secrets-ee938ab0-ef38-4e67-b228-b308af3b7235 for secret-volume-test: an error on the server (\"unknown\") has prevented the request from succeeding (get pods pod-secrets-ee938ab0-ef38-4e67-b228-b308af3b7235)",
    }
    failed to get logs from pod-secrets-ee938ab0-ef38-4e67-b228-b308af3b7235 for secret-volume-test: an error on the server ("unknown") has prevented the request from succeeding (get pods pod-secrets-ee938ab0-ef38-4e67-b228-b308af3b7235)
occurred

Dunno if it's related or not.

/retest

@wking wking mentioned this pull request Oct 4, 2019
@enxebre
Copy link
Member Author

enxebre commented Oct 4, 2019

Installing from initial release registry.svc.ci.openshift.org/ci-op-fsnivbhk/release@sha256:187017077f5d19c9c9b92c56f40741b7c88103c6759c961dd19b6050de82b64f
level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to fetch dependency of \"Common Manifests\": failed to generate asset \"DNS Config\": getting public zone for \"origin-ci-int-aws.dev.rhcloud.com\": listing hosted zones: Throttling: Rate exceeded\n\tstatus code: 400, request id: 15468129-4fcc-40b2-b279-83ccee4043a0"

/retest

@runcom
Copy link
Member

runcom commented Oct 4, 2019

looks like there are no workers in the upgrade job (as in, even mco reports 0 workers)

@enxebre
Copy link
Member Author

enxebre commented Oct 4, 2019

Seems we are hitting aws limits

level=fatal msg="failed to fetch Terraform Variables: failed to fetch dependency of \"Terraform Variables\": failed to fetch dependency of \"Bootstrap Ignition Config\": failed to fetch dependency of \"Common Manifests\": failed to generate asset \"DNS Config\": getting public zone for \"origin-ci-int-aws.dev.rhcloud.com\": listing hosted zones: Throttling: Rate exceeded\n\tstatus code: 400, request id: a765f1d5-e7d2-48ba-b3b9-8cc4fe4df090"```

@enxebre
Copy link
Member Author

enxebre commented Oct 4, 2019

/retest

1 similar comment
@enxebre
Copy link
Member Author

enxebre commented Oct 4, 2019

/retest

@smarterclayton
Copy link
Contributor

/test all

@openshift-ci-robot
Copy link
Contributor

@enxebre: The following test failed, say /retest to rerun them all:

Test name Commit Details Rerun command
ci/prow/e2e-aws-upgrade 1d5caa4 link /test e2e-aws-upgrade

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@smarterclayton
Copy link
Contributor

Upgrading is failing the same way for both the revert and this, so it’s either a separate problem that this PR doesn’t fix, or a more general problem that is somehow happening due to the RHCoS bump. I’m going to force merge this, and follow up to see if an rhcos issue is at play.

@smarterclayton smarterclayton merged commit 12de4a4 into openshift:master Oct 5, 2019
@wking
Copy link
Member

wking commented Oct 5, 2019

... a more general problem that is somehow happening due to the RHCoS bump.

openshift/installer#2455 still open, and we need openshift/installer#2459 first anyway, so I don't think RHCOS comes into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants