-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add option to restart prowjob when node is terminated (enabled by default) #117
base: main
Are you sure you want to change the base?
Add option to restart prowjob when node is terminated (enabled by default) #117
Conversation
✅ Deploy Preview for k8s-prow ready!
To edit notification comments on pull requests, go to your Netlify site configuration. |
db2fd80
to
cb50220
Compare
Hello @inteon
Please take a look! 😃 |
The code still seems to be used in prow-controller-manager: prow/cmd/prow-controller-manager/main.go Lines 202 to 204 in 0a35181
|
cb50220
to
da96973
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: inteon The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
da96973
to
4cb9105
Compare
/label tide/merge-method-squash |
Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). |
This change seems like a nice feature but I am not sure if you are covering all of the generic cases of those pod terminations that are not only related to GCP. Also, this issue can be resolved by adding a pod disruption budget to your cluster. In my opinion, we shouldn't enable this feature by default. Instead, we can configure specific reasons for recreating the prowjobs, which will allow users to be more specific about their infrastructure. |
4cb9105
to
0bf2380
Compare
I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold. @petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input? |
0bf2380
to
e416d6f
Compare
I updated the PR and added a RetryCount which is now incremented every time the pod is re-created (it also counts other retries that were already present in code). There will be a hard failure after 3 retries have been reached (we might want to make this a variable in the future). |
e416d6f
to
6a85753
Compare
|
@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you. |
Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure. |
pkg/plank/reconciler.go
Outdated
// On GCP, before a new spot instance is started, the old pods are garbage | ||
// collected (if they have not been already by the Kubernetes PodGC): | ||
// https://github.com/kubernetes/cloud-provider-gcp/blob/25e5dcc715781316bc5e39f8b17c0d5b313453f7/cmd/gcp-controller-manager/node_csr_approver.go#L1035-L1058 | ||
if condition.Reason == "DeletionByGCPControllerManager" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this is GCP related only, why not make the reasons configurable too, to allow the user to add more cases based on the infrastructure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made the settings configurable: 648600a
I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution. |
I would prefer to let the user to decide the list of reasons to restart the prowjob and the retry count limit. For example, in my infrastructure, we run costly jobs, and this feature can potentially increase the cost since its rerunning them by default for those specific reasons. Your solution is good, but I would prefer to make it configurable so the user won't be limited on hardcoded termination reasons and retry limits. @stevekuznetsov WDYT? |
Getting the config semantic right might be hard but I'm for it. |
648600a
to
5cc1fc8
Compare
Yeah, I share some concerns about the restart implications. For example, with 5k node scale tests, we may prefer to simply take the failure and leave boskos to cleanup rather than attempt to start another run, and yet with the many many CI jobs we have it would be difficult to properly identify and opt out all of the relevant jobs. Also, even as a GCP employee, I think we should prefer to use portable Kubernetes, but I guess this is at least somewhat more portable now ... do any of the other major vendors with spot instances set a similar status that can be matched or do we need a different mechanism entirely? What's the use case where the next periodic attempt and/or |
I suspect for most jobs this is better, if bounded, but it's still a surprising behavior change and there's limited bandwidth to go comb through jobs and opt them in/out. For Kubernetes we could probably opt-out anything reference the boskos scalability pools and be in OK shape. I don't think we want another best practice field that everyone has to opt in though either .... (like What if we have a global option to enable this, in addition to per job opt-out? We could wait to turn this on in until we've opted out any jobs with cost concerns? I'm also not sure about having a single retrycount is the best approach, but I haven't put much thought into it yet. E.G. failure to schedule is pretty distinct from the other modes (Thought I think we already handle that separately?) |
5cc1fc8
to
0695225
Compare
…ault) Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>
0695225
to
9c6dd67
Compare
Recreate the prowjob when the node is terminated.
This prevents us from having to rerun jobs that were terminated due to a spot instance shutdown.
The pod status that we see on pods terminated due to a spot instance shutdown look like this:
or
or
Most code in this PR is a copy of the existing ErrorOnEviction code, updated for ErrorOnTermination.