Add option to restart prowjob when node is terminated (enabled by default) #117

inteon · 2024-04-18T13:31:51Z

Recreate the prowjob when the node is terminated.
This prevents us from having to rerun jobs that were terminated due to a spot instance shutdown.

The pod status that we see on pods terminated due to a spot instance shutdown look like this:

status:
  message: Pod was terminated in response to imminent node shutdown.
  phase: Failed
  reason: Terminated

or

status:
  conditions:
  - message: 'PodGC: node no longer exists'
    reason: DeletionByPodGC
    status: "True"
    type: DisruptionTarget
  phase: Failed

or

status:
  conditions:
  - message: 'GCPControllerManager: node no longer exists'
    reason: DeletionByGCPControllerManager
    status: "True"
    type: DisruptionTarget
  phase: Failed

Most code in this PR is a copy of the existing ErrorOnEviction code, updated for ErrorOnTermination.

netlify · 2024-04-18T13:32:07Z

✅ Deploy Preview for k8s-prow ready!

Name	Link
🔨 Latest commit	`9c6dd67`
🔍 Latest deploy log	https://app.netlify.com/sites/k8s-prow/deploys/66c84689385eeb0008126004
😎 Deploy Preview	https://deploy-preview-117--k8s-prow.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

jihoon-seo · 2024-04-18T14:19:49Z

Hello @inteon
AFAIK Plank was deprecated.
Since this PR is adding codes under 'pkg/plank/',
I wonder whether:

Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or
There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

inteon · 2024-04-18T14:33:51Z

Hello @inteon AFAIK Plank was deprecated. Since this PR is adding codes under 'pkg/plank/', I wonder whether:

Adding codes under 'pkg/plank/' is okay regardless of the deprecation of Plank; or

There exists another method to achieve the same goal, adding codes to directories other than 'pkg/plank/'.

Please take a look! 😃

The code still seems to be used in prow-controller-manager:

prow/cmd/prow-controller-manager/main.go

Lines 202 to 204 in 0a35181

    
           if err := plank.Add(mgr, buildClusterManagers, knownClusters, cfg, opener, o.totURL, o.selector); err != nil { 
        
           	logrus.WithError(err).Fatal("Failed to add plank to manager") 
        
           }

k8s-ci-robot · 2024-04-20T11:58:14Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: inteon
Once this PR has been reviewed and has the lgtm label, please assign petr-muller for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

jihoon-seo · 2024-04-22T01:06:28Z

/label tide/merge-method-squash

inteon · 2024-05-07T09:26:58Z

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot).
It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

droslean · 2024-05-07T09:48:37Z

Info: we use this change on our cert-manager prow cluster https://prow.infra.cert-manager.io/ (see https://github.com/cert-manager/testing/tree/0b5dfa456f691e849bb0b3c40f3e00bd8d607127/images/prow-controller-manager-spot). It works very well, this change has removed a lot of flaky failures due to spot instance shutdowns!

This change seems like a nice feature but I am not sure if you are covering all of the generic cases of those pod terminations that are not only related to GCP. Also, this issue can be resolved by adding a pod disruption budget to your cluster.

In my opinion, we shouldn't enable this feature by default. Instead, we can configure specific reasons for recreating the prowjobs, which will allow users to be more specific about their infrastructure.

droslean · 2024-06-14T13:55:37Z

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

inteon · 2024-06-14T15:33:10Z

I am not confident about enabling this feature by default. The implementation covers only a GCP case, and a pod's termination is deeply dependent on the infrastructure. This means, that a prowjob can run forever in a loop in many cases. Perhaps keeping it disabled as default and allowing the user to enable it in the global config? Also, to avoid the infinite runs, perhaps its better to keep track of the number of the retries and allow the user to control the threshold.

@petr-muller @stevekuznetsov @cgwalters @BenTheElder Do you guys have any input?

I updated the PR and added a RetryCount which is now incremented every time the pod is re-created (it also counts other retries that were already present in code). There will be a hard failure after 3 retries have been reached (we might want to make this a variable in the future).

inteon · 2024-07-01T13:25:45Z

~~Blocked by #196 since CRD is too large for my setup.~~

stevekuznetsov · 2024-07-01T14:38:35Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

droslean · 2024-07-01T14:45:17Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

droslean · 2024-07-01T14:47:55Z

pkg/plank/reconciler.go

+		// On GCP, before a new spot instance is started, the old pods are garbage
+		// collected (if they have not been already by the Kubernetes PodGC):
+		// https://github.com/kubernetes/cloud-provider-gcp/blob/25e5dcc715781316bc5e39f8b17c0d5b313453f7/cmd/gcp-controller-manager/node_csr_approver.go#L1035-L1058
+		if condition.Reason == "DeletionByGCPControllerManager" {


Since this is GCP related only, why not make the reasons configurable too, to allow the user to add more cases based on the infrastructure?

I made the settings configurable: 648600a

inteon · 2024-07-01T14:49:39Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

droslean · 2024-07-01T15:13:15Z

@droslean I don't think PDB would help here since you don't get a choice as to when spot instances get taken away from you.

Yep. My only concern is whether we should allow this feature to be enabled by default. Since Prow doesn't directly know what the job will do, there can be cases where the job costs will triple if we allow this by default. I would prefer to make this configuration and let the user decide based on the infrastructure.

I discovered there is already a lot of logic to restart pods (and no limit): Eviction, NodeUnreachable, a PodUnknown Phase. So, I added a global limit of 3 and applied that to all restart, including the existing logic. I haven't (yet) made the restarts in case of a Spot instance restart disabled by default, because I think the retryCount limit is a better solution.

I would prefer to let the user to decide the list of reasons to restart the prowjob and the retry count limit. For example, in my infrastructure, we run costly jobs, and this feature can potentially increase the cost since its rerunning them by default for those specific reasons. Your solution is good, but I would prefer to make it configurable so the user won't be limited on hardcoded termination reasons and retry limits. @stevekuznetsov WDYT?

stevekuznetsov · 2024-07-01T16:12:54Z

Getting the config semantic right might be hard but I'm for it.

BenTheElder · 2024-07-01T21:04:35Z

Yeah, I share some concerns about the restart implications.

For example, with 5k node scale tests, we may prefer to simply take the failure and leave boskos to cleanup rather than attempt to start another run, and yet with the many many CI jobs we have it would be difficult to properly identify and opt out all of the relevant jobs.

cc @aojea @pohly

Also, even as a GCP employee, I think we should prefer to use portable Kubernetes, but I guess this is at least somewhat more portable now ... do any of the other major vendors with spot instances set a similar status that can be matched or do we need a different mechanism entirely?

What's the use case where the next periodic attempt and/or /retest is not sufficient?
Are you using the automatic /retest commenter? It's clunky but it has done the job ~OK.

BenTheElder · 2024-07-01T21:16:55Z

I suspect for most jobs this is better, if bounded, but it's still a surprising behavior change and there's limited bandwidth to go comb through jobs and opt them in/out.

For Kubernetes we could probably opt-out anything reference the boskos scalability pools and be in OK shape.

I don't think we want another best practice field that everyone has to opt in though either .... (like decorate: true which should ideally not be redundant eventually)

What if we have a global option to enable this, in addition to per job opt-out? We could wait to turn this on in until we've opted out any jobs with cost concerns?

I'm also not sure about having a single retrycount is the best approach, but I haven't put much thought into it yet. E.G. failure to schedule is pretty distinct from the other modes (Thought I think we already handle that separately?)

…ault) Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>

Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Apr 18, 2024

k8s-ci-robot requested review from cjwagner and droslean April 18, 2024 13:31

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Apr 18, 2024

inteon force-pushed the option_recreate_prowjob_on_termination branch from db2fd80 to cb50220 Compare April 18, 2024 13:45

k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Apr 18, 2024

inteon force-pushed the option_recreate_prowjob_on_termination branch from cb50220 to da96973 Compare April 20, 2024 11:58

inteon force-pushed the option_recreate_prowjob_on_termination branch from da96973 to 4cb9105 Compare April 20, 2024 11:58

inteon mentioned this pull request Apr 20, 2024

Add custom version of prow-controller-manager with better support for spot instances cert-manager/testing#1010

Merged

k8s-ci-robot added the tide/merge-method-squash Denotes a PR that should be squashed by tide when it merges. label Apr 22, 2024

inteon force-pushed the option_recreate_prowjob_on_termination branch from 4cb9105 to 0bf2380 Compare June 13, 2024 15:10

inteon force-pushed the option_recreate_prowjob_on_termination branch from 0bf2380 to e416d6f Compare June 14, 2024 15:30

inteon force-pushed the option_recreate_prowjob_on_termination branch from e416d6f to 6a85753 Compare June 17, 2024 11:53

droslean reviewed Jul 1, 2024

View reviewed changes

inteon force-pushed the option_recreate_prowjob_on_termination branch from 648600a to 5cc1fc8 Compare July 1, 2024 16:15

inteon force-pushed the option_recreate_prowjob_on_termination branch from 5cc1fc8 to 0695225 Compare July 9, 2024 19:36

inteon added 4 commits August 23, 2024 10:21

add option to restart prowjob when node is terminated (enabled by def…

cf26c50

…ault) Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>

better detect pod failure due to node termination

a4067e8

Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>

unify retry logic and add RetryCount and MaxPodRetries

2211ce1

Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>

make plank settings configurable

9c6dd67

Signed-off-by: Tim Ramlot <42113979+inteon@users.noreply.github.com>

inteon force-pushed the option_recreate_prowjob_on_termination branch from 0695225 to 9c6dd67 Compare August 23, 2024 08:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to restart prowjob when node is terminated (enabled by default) #117

Add option to restart prowjob when node is terminated (enabled by default) #117

inteon commented Apr 18, 2024 •

edited

Loading

netlify bot commented Apr 18, 2024 •

edited

Loading

jihoon-seo commented Apr 18, 2024

inteon commented Apr 18, 2024

k8s-ci-robot commented Apr 20, 2024

jihoon-seo commented Apr 22, 2024

inteon commented May 7, 2024

droslean commented May 7, 2024

droslean commented Jun 14, 2024

inteon commented Jun 14, 2024

inteon commented Jul 1, 2024 •

edited

Loading

stevekuznetsov commented Jul 1, 2024

droslean commented Jul 1, 2024

droslean Jul 1, 2024

inteon Jul 1, 2024

inteon commented Jul 1, 2024 •

edited

Loading

droslean commented Jul 1, 2024

stevekuznetsov commented Jul 1, 2024

BenTheElder commented Jul 1, 2024

BenTheElder commented Jul 1, 2024

Add option to restart prowjob when node is terminated (enabled by default) #117

Are you sure you want to change the base?

Add option to restart prowjob when node is terminated (enabled by default) #117

Conversation

inteon commented Apr 18, 2024 • edited Loading

netlify bot commented Apr 18, 2024 • edited Loading

✅ Deploy Preview for k8s-prow ready!

jihoon-seo commented Apr 18, 2024

inteon commented Apr 18, 2024

k8s-ci-robot commented Apr 20, 2024

jihoon-seo commented Apr 22, 2024

inteon commented May 7, 2024

droslean commented May 7, 2024

droslean commented Jun 14, 2024

inteon commented Jun 14, 2024

inteon commented Jul 1, 2024 • edited Loading

stevekuznetsov commented Jul 1, 2024

droslean commented Jul 1, 2024

droslean Jul 1, 2024

Choose a reason for hiding this comment

inteon Jul 1, 2024

Choose a reason for hiding this comment

inteon commented Jul 1, 2024 • edited Loading

droslean commented Jul 1, 2024

stevekuznetsov commented Jul 1, 2024

BenTheElder commented Jul 1, 2024

BenTheElder commented Jul 1, 2024

inteon commented Apr 18, 2024 •

edited

Loading

netlify bot commented Apr 18, 2024 •

edited

Loading

inteon commented Jul 1, 2024 •

edited

Loading

inteon commented Jul 1, 2024 •

edited

Loading