-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make co/authentication rotation robust in SNO env test #2695
Make co/authentication rotation robust in SNO env test #2695
Conversation
/lgtm |
@@ -168,6 +168,7 @@ | |||
interval_time = 5 | |||
timeout = Integer(timeout) rescue 60 | |||
interval_time = 20 if timeout > 100 | |||
interval_time = 5 if env.nodes.length == 1 # interval needs to be short because rotation is quick in SNO env |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check line 168, interval_time
defaults to 5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liangxia line 170 overrides it, the case has steps of timeout > 100 (180 seconds), this caused the failure, see the Description part:
#<RuntimeError: The authentication operator still didn't become {"Progressing"=>"True"} after 180 seconds>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps that means you need a shorter timeout rather than overwrite it again
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need it due to https://github.com/openshift/verification-tests/pull/2218/files#diff-8efd8ed73179d3b17bdfae8adad9cf49e0a0ad91eaf68e3b2a36766289d56616R41 . Line 171 is indeed needed for rollout cases in SNO env which is special in having only single replia for each component.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you help check again per previous comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liangxia @pruan-rht , could you help merge if no more objection given above reasons explained? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's for the rollout, but here we are talking about operator in progressing.
It should work if you change
operator "authentication" becomes progressing within 180 seconds
to
operator "authentication" becomes progressing within 100 seconds
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operator "authentication" becomes progressing within 100 seconds
met failures due to 1958198 as mentioned in previous comment's link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the workaround for the bug spreads here and there. Any plan to cleanup those workarounds after the bug is fixed ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here and there
Understood your concern, and frankly speaking, I hate workarounds too, just had to do.
About "here and there", I have to explain "here and there" are both needed. For "there" (i.e. let the timeout big enough), it is due to the bug 1958198 there. For "here" (i.e. let interval_time = 5 if env.nodes.length == 1
), it is due to SNO feature (see API-1136) that transitions statuses quicker. Please don't worry about cleanup, when ON_QA we of course will remember to clean up. Thanks!
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: liangxia The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@xingxingxia: all tests passed! Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
Fixing OCPQE-8730.
The failure has 2 parts to fix:
One is:
It actually had become "Progressing"=>"True". But the poll interval is too big and the transition to False is quick in SNO env. We need use small interval for SNO env.
The other issue is:
When oauth.config.openshift.io/cluster is updated, auth pods are rotating and unavailable temporarily, token validation will be broken, i.e. here for request of
getting projects by user
. We need ensure auth pods rotation is complete first.Pass logs for fix in SNO env: ocp-c1/job/Runner/322630/console
@y4sht @liangxia help review / merge, thanks!