Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make co/authentication rotation robust in SNO env test #2695

Conversation

xingxingxia
Copy link
Contributor

Fixing OCPQE-8730.

The failure has 2 parts to fix:
One is:

01:15:12 INFO> #### Operator authentication Expected conditions: {"Progressing"=>"True"}
#<RuntimeError: The authentication operator still didn't become {"Progressing"=>"True"} after 180 seconds>

It actually had become "Progressing"=>"True". But the poll interval is too big and the transition to False is quick in SNO env. We need use small interval for SNO env.

The other issue is:

01:15:12 INFO> Shell Commands: oc replace -f - --kubeconfig=/home/jenkins/ws/workspace/ocp-common/Runner/workdir/ocp4_admin.kubeconfig
oauth.config.openshift.io/cluster replaced
error: You must be logged in to the server (Unauthorized)
01:16:44 ERROR> #<RuntimeError: error getting projects by user: 'testuser-40@ocp4'>

When oauth.config.openshift.io/cluster is updated, auth pods are rotating and unavailable temporarily, token validation will be broken, i.e. here for request of getting projects by user. We need ensure auth pods rotation is complete first.
Pass logs for fix in SNO env: ocp-c1/job/Runner/322630/console
@y4sht @liangxia help review / merge, thanks!

@openshift-ci openshift-ci bot requested review from dis016 and gpei February 21, 2022 10:26
@y4sht
Copy link
Contributor

y4sht commented Feb 24, 2022

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 24, 2022
@@ -168,6 +168,7 @@
interval_time = 5
timeout = Integer(timeout) rescue 60
interval_time = 20 if timeout > 100
interval_time = 5 if env.nodes.length == 1 # interval needs to be short because rotation is quick in SNO env
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check line 168, interval_time defaults to 5

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangxia line 170 overrides it, the case has steps of timeout > 100 (180 seconds), this caused the failure, see the Description part:

#<RuntimeError: The authentication operator still didn't become {"Progressing"=>"True"} after 180 seconds>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps that means you need a shorter timeout rather than overwrite it again

Copy link
Contributor Author

@xingxingxia xingxingxia Feb 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need it due to https://github.com/openshift/verification-tests/pull/2218/files#diff-8efd8ed73179d3b17bdfae8adad9cf49e0a0ad91eaf68e3b2a36766289d56616R41 . Line 171 is indeed needed for rollout cases in SNO env which is special in having only single replia for each component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you help check again per previous comment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liangxia @pruan-rht , could you help merge if no more objection given above reasons explained? Thanks!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's for the rollout, but here we are talking about operator in progressing.

It should work if you change
operator "authentication" becomes progressing within 180 seconds to
operator "authentication" becomes progressing within 100 seconds

Copy link
Contributor Author

@xingxingxia xingxingxia Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operator "authentication" becomes progressing within 100 seconds met failures due to 1958198 as mentioned in previous comment's link.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the workaround for the bug spreads here and there. Any plan to cleanup those workarounds after the bug is fixed ?

Copy link
Contributor Author

@xingxingxia xingxingxia Mar 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here and there

Understood your concern, and frankly speaking, I hate workarounds too, just had to do.
About "here and there", I have to explain "here and there" are both needed. For "there" (i.e. let the timeout big enough), it is due to the bug 1958198 there. For "here" (i.e. let interval_time = 5 if env.nodes.length == 1), it is due to SNO feature (see API-1136) that transitions statuses quicker. Please don't worry about cleanup, when ON_QA we of course will remember to clean up. Thanks!

@liangxia
Copy link
Member

liangxia commented Mar 3, 2022

/approve

@openshift-ci
Copy link

openshift-ci bot commented Mar 3, 2022

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: liangxia

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 3, 2022
@openshift-ci
Copy link

openshift-ci bot commented Mar 3, 2022

@xingxingxia: all tests passed!

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@openshift-merge-robot openshift-merge-robot merged commit 604c675 into openshift:master Mar 3, 2022
sunzhaohua2 pushed a commit to sunzhaohua2/verification-tests that referenced this pull request Jun 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants