-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: (sig-testing) continuously deploy k8s prow #2540
Conversation
/assign @spiffxp Please feel free assign to other approvers that you see fit. |
@chaodaiG I'm not a SIG Release chair (you may have misunderstood what Aaron said. :-)) |
Thank you Arnaud! |
30faaa1
to
6f2a674
Compare
Can't repro this test failure locally on mac |
6f2a674
to
4cd4462
Compare
Repro on linux, the root cause was |
As proposed in kubernetes/enhancements#2540, k8s prow will be bumped more frequently than once per work day, adding this label to the job so that it could help associate failures with specific prow versions
Related kubernetes/enhancements#2540 Report prow deployment job status on Slack instead of oncall manually posting
Related kubernetes/enhancements#2540 Report prow deployment job status on Slack instead of oncall manually posting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
/lgtm
I ask that @kubernetes/sig-release-leads and other interested parties please add comments here on what they want to see to get this to implementable
|
||
## Proposal | ||
|
||
Prow autobump PRs are automatically merged every hour, only on working hours of working days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed in sig testing meeting starting at a much longer interval given that some jobs don't complete for O(2h), and having a slower update rate may make for easier troubleshooting
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll start with 3 hours for now.
|
||
#### Automated Merging of Prow Autobump PRs | ||
|
||
- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should decouple prow auto bump from job image auto bump (I have an open issue for this in test-infra)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a requirement, IMO, since we're going to alter the frequency and they're currently coupled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/test-infra#21137 there's the issue for it
participating-sigs: | ||
- sig-testing | ||
- sig-release | ||
status: provisional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address the review comments and I am happy to merge a followup that bumps this to implementable
- Manually apply the changes from rollback PR. | ||
|
||
``` | ||
<<[UNRESOLVED]>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would to see communication plan as part of this, e.g how and when will you announce this change:
- downstream
- to the users of prow.k8s.io
We should include plank version in testgrid.k8s.io jobs (cc @MushuEE)
I think this shouldn't be subject to v1.21 release freeze dates, but we should have a plan for how to be respectful toward the test freeze -> release phase.
This should have alpha/beta/ga phases called out. Low frequency as part of alpha. Test grid plumbed as part of beta. High frequency and decoupled jobs as part of GA
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not super familiar with k8s release timeline. I see that Test Freeze - Wednesday, March 24
, The deadline to submit an entry is Thursday, March 25, EOD Pacific Time
from latest announcement. So the official release date is probably some time after March 25, should we wait until the official release of 1.21.0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have updated #2553 with:
- announcement
- alpha/beta/ga phases
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: alvaroaleman, chaodaiG, spiffxp The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first pass: there's a lot I'd like to see before implementable.
## Implementation History | ||
|
||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not seem very detailed currently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, added much more details
|
||
When prow stopped functioning after a bump, prow oncall should: | ||
- Stop auto-deploying by commenting `/hold` on latest autobump PR. | ||
- Manually create rollback PR for rolling back to known good version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with more details on how to find good version, mostly from what @alvaroaleman mentioned on sig-testing meeting, since Openshift had been doing this for quite a while
When prow stopped functioning after a bump, prow oncall should: | ||
- Stop auto-deploying by commenting `/hold` on latest autobump PR. | ||
- Manually create rollback PR for rolling back to known good version. | ||
- Manually apply the changes from rollback PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
|
||
#### Automated Merging of Prow Autobump PRs | ||
|
||
- Prow autobump job is already configured to run on work days only, change it to at least one hour apart, so that it doesn’t bump more frequently than one hour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a requirement, IMO, since we're going to alter the frequency and they're currently coupled.
#### Breaking Changes in Prow | ||
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs. | ||
One possible way of dealing with breaking changes, is: | ||
- Prow oncall inspects prow logs and breaking changes announcements once per week, and take actions based on deprecation warnings from prow logs and breaking changes from ANNOUNCEMENTS.md. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this sufficient if we're merging hourly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, as mentioned in https://github.com/kubernetes/enhancements/pull/2540/files#r586723114, we'll start with 3 hours, and we plan to rely on alerts instead of oncall scanning log for prow errors discovery
### Notes/Constraints/Caveats (Optional) | ||
|
||
#### Breaking Changes in Prow | ||
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps this should change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree with you, but I think this should be in a separate discussion, in thinking about the scope.
|
||
#### Prow Users | ||
|
||
Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
our most recent outage (the webhook handler panic-ing and dropping events) was discovered by non-oncall humans, even at the current update frequency.
how do we plan to address this?
kubernetes/test-infra#21090
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/test-infra#21090 (comment) let's noodle over here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/test-infra#21090 (comment) pointed out that this could be discovered by crash-looping detection, does this work?
My general idea is this is an ongoing process, we need to make sure each of these new types of non-oncall humans discovered prow errors are discoverable by alert in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cross-posting: kubernetes/test-infra#21090 (comment), the issue in #21090 will be caught by prometheus alerting in the future
- What’s Not Changed | ||
- React to prow alerts and take actions. | ||
- What’s Changed | ||
- No more manual inspecting prow healthiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in favor of?
one of the things manually inspected right now is the logs, in which we have failures that are not caught by automated monitoring today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. This is an over statement. Not actually meant to "never" inspecting prow logs, I think it makes more sense for prow oncall to inspect once per week for sanity checks. As pointed out above, in the future we would like to discover prow errors from passive alerts instead of active inspecting logs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Responded to most of comments here, and the mentioned improvements are all included in #2553.
There is one comment https://github.com/kubernetes/enhancements/pull/2540/files#r586729058 not covered yet, will need to think a little bit how to write those
|
||
#### Prow Users | ||
|
||
Shouldn’t see any change, prow breakage should be discovered by prow monitoring system and rollback will be performed. The chance of prow being break is almost identical to what we have today(Assume there are not more than a single breaking change every day). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kubernetes/test-infra#21090 (comment) pointed out that this could be discovered by crash-looping detection, does this work?
My general idea is this is an ongoing process, we need to make sure each of these new types of non-oncall humans discovered prow errors are discoverable by alert in the future.
- What’s Not Changed | ||
- React to prow alerts and take actions. | ||
- What’s Changed | ||
- No more manual inspecting prow healthiness. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point. This is an over statement. Not actually meant to "never" inspecting prow logs, I think it makes more sense for prow oncall to inspect once per week for sanity checks. As pointed out above, in the future we would like to discover prow errors from passive alerts instead of active inspecting logs.
|
||
## Proposal | ||
|
||
Prow autobump PRs are automatically merged every hour, only on working hours of working days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll start with 3 hours for now.
#### Breaking Changes in Prow | ||
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs. | ||
One possible way of dealing with breaking changes, is: | ||
- Prow oncall inspects prow logs and breaking changes announcements once per week, and take actions based on deprecation warnings from prow logs and breaking changes from ANNOUNCEMENTS.md. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, as mentioned in https://github.com/kubernetes/enhancements/pull/2540/files#r586723114, we'll start with 3 hours, and we plan to rely on alerts instead of oncall scanning log for prow errors discovery
### Notes/Constraints/Caveats (Optional) | ||
|
||
#### Breaking Changes in Prow | ||
Breaking changes in prow will require manual intervention. Currently prow isn’t able to handle these intelligently, as it was not designed with the mindset of API versions and thus kubernetes conversion webhook can not help coping with breaking changes among major APIs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally agree with you, but I think this should be in a separate discussion, in thinking about the scope.
|
||
When prow stopped functioning after a bump, prow oncall should: | ||
- Stop auto-deploying by commenting `/hold` on latest autobump PR. | ||
- Manually create rollback PR for rolling back to known good version. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with more details on how to find good version, mostly from what @alvaroaleman mentioned on sig-testing meeting, since Openshift had been doing this for quite a while
When prow stopped functioning after a bump, prow oncall should: | ||
- Stop auto-deploying by commenting `/hold` on latest autobump PR. | ||
- Manually create rollback PR for rolling back to known good version. | ||
- Manually apply the changes from rollback PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
## Implementation History | ||
|
||
|
||
## Alternatives |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad, added much more details
ref: #2539 |
KEP-2539: Addressing comments from #2540
As suggested on sig-testing meeting today, creating this KEP for open discussion. This is based off of the design doc presented on sig-testing https://docs.google.com/document/d/1pBouf_tgJJ2Gga9Xa5-xObLjN7PiVXiBzL3RYTy56ng.
This proposal is aiming at automate the deployment of k8s prow for easing prow oncall workloads.