Skip to content

Conversation

@creydr
Copy link
Member

@creydr creydr commented Jan 10, 2022

Currently in SNO deployments cloud-network-config-controller does not tolerate missing API server for longer than 10s and then it just kill itself and restarts. This is considered a test failure (see Slack: https://coreos.slack.com/archives/C02NZBANL3G/p1641823800130400):

W0106 15:59:58.068015       1 client_config.go:617] Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
I0106 15:59:58.068922       1 leaderelection.go:248] attempting to acquire leader lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock...
I0106 15:59:58.101962       1 leaderelection.go:258] successfully acquired lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock
I0106 15:59:58.102776       1 controller.go:88] Starting node controller
I0106 15:59:58.102794       1 controller.go:91] Waiting for informer caches to sync for node workqueue
I0106 15:59:58.102813       1 controller.go:88] Starting secret controller
I0106 15:59:58.102823       1 controller.go:91] Waiting for informer caches to sync for secret workqueue
I0106 15:59:58.102837       1 controller.go:88] Starting cloud-private-ip-config controller
I0106 15:59:58.102846       1 controller.go:91] Waiting for informer caches to sync for cloud-private-ip-config workqueue
I0106 15:59:58.106928       1 controller.go:182] Assigning key: ip-10-0-155-2.ec2.internal to node workqueue
I0106 15:59:58.203784       1 controller.go:96] Starting cloud-private-ip-config workers
I0106 15:59:58.203899       1 controller.go:102] Started cloud-private-ip-config workers
I0106 15:59:58.203787       1 controller.go:96] Starting node workers
I0106 15:59:58.203959       1 controller.go:102] Started node workers
I0106 15:59:58.203987       1 controller.go:160] Dropping key 'ip-10-0-155-2.ec2.internal' from the node workqueue
I0106 15:59:58.203789       1 controller.go:96] Starting secret workers
I0106 15:59:58.204081       1 controller.go:102] Started secret workers
E0106 16:27:46.492843       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.ci-op-32xcdvcc-f2b1a.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp 10.0.134.157:6443: connect: connection refused
E0106 16:27:48.496582       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.ci-op-32xcdvcc-f2b1a.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp 10.0.134.157:6443: connect: connection refused
E0106 16:27:50.496183       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.ci-op-32xcdvcc-f2b1a.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp 10.0.134.157:6443: connect: connection refused
E0106 16:27:52.495801       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.ci-op-32xcdvcc-f2b1a.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp 10.0.134.157:6443: connect: connection refused
E0106 16:27:54.496372       1 leaderelection.go:330] error retrieving resource lock openshift-cloud-network-config-controller/cloud-network-config-controller-lock: Get "https://api-int.ci-op-32xcdvcc-f2b1a.aws-2.ci.openshift.org:6443/api/v1/namespaces/openshift-cloud-network-config-controller/configmaps/cloud-network-config-controller-lock": dial tcp 10.0.134.157:6443: connect: connection refused
I0106 16:27:56.490684       1 leaderelection.go:283] failed to renew lease openshift-cloud-network-config-controller/cloud-network-config-controller-lock: timed out waiting for the condition
E0106 16:27:56.490775       1 leaderelection.go:306] Failed to release lock: resource name may not be empty
I0106 16:27:56.490785       1 main.go:159] Stopped leading, sending SIGTERM and shutting down controller
I0106 16:27:56.490866       1 controller.go:104] Shutting down node workers
I0106 16:27:56.490873       1 controller.go:104] Shutting down secret workers
I0106 16:27:56.490880       1 controller.go:104] Shutting down cloud-private-ip-config workers
I0106 16:27:56.490913       1 main.go:166] Finished executing controlled shutdown

This PR addresses it and sets the leader election timeouts to the recommended values according to openshift/library-go/pull/1104 to be able to handle api server disruption on SNO.

To be able to handle an api server disruption on SNO, the leader
election timeouts needs to be adjusted according to
github.com/openshift/library-go/pull/1104.

Signed-off-by: Christoph Stäbler <cstabler@redhat.com>
@creydr
Copy link
Member Author

creydr commented Jan 10, 2022

/assign @alexanderConstantinescu
/assign @mfojtik

@creydr
Copy link
Member Author

creydr commented Jan 10, 2022

/retitle Bug 2033751: Update leader election timeouts to handle api server disruptions on SNO

@openshift-ci openshift-ci bot changed the title Update leader election timeouts to handle api server disruptions on SNO Bug 2033751: Update leader election timeouts to handle api server disruptions on SNO Jan 10, 2022
@openshift-ci openshift-ci bot added bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. labels Jan 10, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 10, 2022

@creydr: This pull request references Bugzilla bug 2033751, which is valid. The bug has been updated to refer to the pull request using the external bug tracker.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 2033751: Update leader election timeouts to handle api server disruptions on SNO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot requested a review from wangke19 January 10, 2022 15:27
@mfojtik
Copy link

mfojtik commented Jan 10, 2022

/lgtm

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 10, 2022

@creydr: This pull request references Bugzilla bug 2033751, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target release (4.10.0) matches configured target release for branch (4.10.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, ON_DEV, POST, POST)

Requesting review from QA contact:
/cc @wangke19

In response to this:

Bug 2033751: Update leader election timeouts to handle api server disruptions on SNO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 10, 2022
Copy link
Contributor

@alexanderConstantinescu alexanderConstantinescu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the library code there seems to be a distinction between HighlyAvailableTopologyMode and SingleReplicaToplogy. This is not taken into account in this PR, but I assume it might not be that important at this point and for this component?

/lgtm

@creydr
Copy link
Member Author

creydr commented Jan 10, 2022

In the library code there seems to be a distinction between HighlyAvailableTopologyMode and SingleReplicaToplogy. This is not taken into account in this PR, but I assume it might not be that important at this point and for this component?

@alexanderConstantinescu: library-go also has a function LeaderElectionDefaulting which sets the defaults accordingly. But I didn't want to import the package only cause of this function.

@creydr
Copy link
Member Author

creydr commented Jan 11, 2022

@abhat could you have a look at this PR? It is for https://coreos.slack.com/archives/C02NZBANL3G/p1641823800130400

/assign @abhat

@squeed
Copy link
Contributor

squeed commented Jan 11, 2022

@alexanderConstantinescu: library-go also has a function LeaderElectionDefaulting which sets the defaults accordingly. But I didn't want to import the package only cause of this function.

Indeed, this is ultimately going to be upstreamed (long story), so avoiding as many openshift libraries as possible would be good.

@squeed
Copy link
Contributor

squeed commented Jan 11, 2022

/approve

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 11, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alexanderConstantinescu, creydr, mfojtik, squeed

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 11, 2022
@openshift-merge-robot openshift-merge-robot merged commit 6d25996 into openshift:master Jan 11, 2022
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 11, 2022

@creydr: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

These pull request must merge or be unlinked from the Bugzilla bug in order for it to move to the next state. Once unlinked, request a bug refresh with /bugzilla refresh.

Bugzilla bug 2033751 has not been moved to the MODIFIED state.

In response to this:

Bug 2033751: Update leader election timeouts to handle api server disruptions on SNO

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 11, 2022

@creydr: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/severity-high Referenced Bugzilla bug's severity is high for the branch this PR is targeting. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants