Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backoff Limit Per Index For Indexed Jobs #3850

Open
8 tasks done
jensentanlo opened this issue Feb 7, 2023 · 36 comments
Open
8 tasks done

Backoff Limit Per Index For Indexed Jobs #3850

jensentanlo opened this issue Feb 7, 2023 · 36 comments
Assignees
Labels
sig/apps Categorizes an issue or PR as relevant to SIG Apps. stage/beta Denotes an issue tracking an enhancement targeted for Beta status wg/batch Categorizes an issue or PR as relevant to WG Batch.

Comments

@jensentanlo
Copy link
Contributor

jensentanlo commented Feb 7, 2023

Enhancement Description

@k8s-ci-robot k8s-ci-robot added the needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Feb 7, 2023
@jensentanlo
Copy link
Contributor Author

/sig apps
/wg batch

@k8s-ci-robot k8s-ci-robot added sig/apps Categorizes an issue or PR as relevant to SIG Apps. wg/batch Categorizes an issue or PR as relevant to WG Batch. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Feb 7, 2023
@jensentanlo jensentanlo changed the title Backoff Limit Per Job Backoff Limit Per Index For Indexed Jobs Feb 7, 2023
@alculquicondor
Copy link
Member

/assign @mimowo

@alculquicondor
Copy link
Member

In addition to configuring the backoff per index, we should probably have FailIndex as one of the actions for pod failure policies.

@soltysh
Copy link
Contributor

soltysh commented May 30, 2023

/milestone v1.28
/stage alpha
/label lead-opted-in

@k8s-ci-robot k8s-ci-robot added the stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status label May 30, 2023
@k8s-ci-robot k8s-ci-robot added this to the v1.28 milestone May 30, 2023
@k8s-ci-robot k8s-ci-robot added the lead-opted-in Denotes that an issue has been opted in to a release label May 30, 2023
@aramase
Copy link
Member

aramase commented Jun 14, 2023

Hello @mimowo 👋, Enhancements team here.

Just checking in as we approach enhancements freeze on 01:00 UTC Friday, 16th June 2023.

This enhancement is targeting for stage alpha for 1.28 (correct me, if otherwise)

Here's where this enhancement currently stands:

  • KEP readme using the latest template has been merged into the k/enhancements repo.
  • KEP status is marked as implementable for latest-milestone: 1.28
  • KEP readme has a updated detailed test plan section filled out
  • KEP readme has up to date graduation criteria
  • KEP has a production readiness review that has been completed and merged into k/enhancements.

The status of this enhancement is marked as at risk. Please keep the issue description up-to-date with appropriate stages as well. Thank you!

@mimowo
Copy link
Contributor

mimowo commented Jun 14, 2023

@aramase I think the first point is addressed as the KEP has been merged: #3967.

@mimowo
Copy link
Contributor

mimowo commented Jun 15, 2023

@aramase is there anything missing to make it tracked?

@Atharva-Shinde
Copy link
Contributor

Hey @mimowo
With all the KEP requirements in place and merged into k/enhancements, this enhancement is all good for the upcoming enhancements freeze. 🚀

The status of this enhancement is marked as tracked. Please keep the issue description up-to-date with appropriate stages as well. Thank you :)

@Rishit-dagli
Copy link
Member

Hello @mimowo 👋, 1.28 Docs Lead here.

Does this enhancement work planned for 1.28 require any new docs or modification to existing docs?

If so, please follows the steps here to open a PR against dev-1.28 branch in the k/website repo. This PR can be just a placeholder at this time and must be created before Thursday 20th July 2023.

Also, take a look at Documenting for a release to get yourself familiarize with the docs requirement for the release.

Thank you!

@aramase
Copy link
Member

aramase commented Jul 17, 2023

Hey again @mimowo 👋

Just checking in as we approach Code freeze at 01:00 UTC Friday, 19th July 2023 .

Here’s the enhancement’s state for the upcoming code freeze:

  • All the PRs that are related to your enhancement are linked in the above issue description (for tracking purposes). This includes code, tests, and documentation related PR/s.
  • All code related PR/s are merged or are in merge-ready state ( i.e they have approved and lgtm labels applied) by the code freeze deadline. This includes any tests related PR/s too.

I see kubernetes/kubernetes#118009 PR in the issue description. If there are any other k/k related PR(s) that we should be tracking for this KEP please link them in the issue description above.

As always, we are here to help if any questions come up. Thanks!

@Atharva-Shinde
Copy link
Contributor

Hey @mimowo 👋 Enhancements Lead here,
With kubernetes/kubernetes#118009 and
kubernetes/kubernetes#119294 merged as per the issue description, this enhancement is now tracked for v1.28 Code Freeze!

@jensentanlo
Copy link
Contributor Author

I performed some manual testing on this feature and saw everything working as expected, a short summary of the details are below if you're interested.


I ran on a local kind cluster (1.28) with alpha feature gate enabled, indexed jobs with completions = 1000,
mainly checking whether:

  1. All pods ran to completion or failure
  2. All failed indices are correctly recorded on the job object

Related to indexed jobs in general but not this specific feature, I was also interested in the delete behavior, because I've had trouble with bulk deletions of non-indexed jobs in the past, but it looks like everything was correctly cleaned up relatively quickly, even though I was churning through a couple indexed jobs (so thousands of pods) on my local machine.

@salehsedghpour
Copy link

/remove-label lead-opted-in

@k8s-ci-robot k8s-ci-robot removed the lead-opted-in Denotes that an issue has been opted in to a release label Jan 6, 2024
@salehsedghpour
Copy link

Hello 👋 1.30 Enhancements Lead here,

I'm closing milestone 1.29 now,
If you wish to progress this enhancement in v1.30, please follow the instructions here to opt in the enhancement and make sure the lead-opted-in label is set so it can get added to the tracking board and finally add /milestone v1.30. Thanks!

/milestone clear

@kannon92
Copy link
Contributor

kannon92 commented Feb 1, 2024

@mimowo if I am not mistaken, this feature should have a stage of beta.

It turns out I can update the label!

@kannon92
Copy link
Contributor

kannon92 commented Feb 1, 2024

/stage beta

@k8s-ci-robot k8s-ci-robot added stage/beta Denotes an issue tracking an enhancement targeted for Beta status and removed stage/alpha Denotes an issue tracking an enhancement targeted for Alpha status labels Feb 1, 2024
@salehsedghpour
Copy link

Hi @soltysh, @mimowo, and @kannon92 , Enhancements Team here! Just wondering, if you are aiming to have this Enhancement in 1.30. If yes, please follow the instructions here to opt in the enhancement and make sure the lead-opted-in label is set so it can get added to the tracking board and finally add /milestone v1.30. Thanks!

@alculquicondor
Copy link
Member

No plans to graduate in this release.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 1, 2024
@alculquicondor
Copy link
Member

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 2, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024
@mimowo
Copy link
Contributor

mimowo commented Jul 31, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 31, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2024
@mimowo
Copy link
Contributor

mimowo commented Oct 29, 2024

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 29, 2024
@mbobrovskyi
Copy link

Hi folks! We are using this feature on kjobctl here. The use case requires moving to the next index, even if there's an issue with the previous one. So that we don't stop running the job when one index fails. It would be great move forward to stable version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/apps Categorizes an issue or PR as relevant to SIG Apps. stage/beta Denotes an issue tracking an enhancement targeted for Beta status wg/batch Categorizes an issue or PR as relevant to WG Batch.
Projects
Status: Tracked
Status: Tracked for Code Freeze
Status: Backlog
Development

No branches or pull requests