Closed
Description
Release Checklist
- All OWNERS must LGTM the release proposal
- Verify that the changelog in this issue is up-to-date
- For major or minor releases (v$MAJ.$MIN.0), create a new release branch.
- an OWNER creates a vanilla release branch with
git branch release-$MAJ.$MIN main
- An OWNER pushes the new release branch with
git push release-$MAJ.$MIN
- an OWNER creates a vanilla release branch with
- Update things like README, deployment templates, docs, configuration, test/e2e flags.
Submit a PR against the release branch: - An OWNER prepares a draft release
- Write the change log into the draft release.
- Run
make artifacts IMAGE_REGISTRY=registry.k8s.io/jobset GIT_TAG=$VERSION
to generate the artifacts and upload the files in theartifacts
folder
to the draft release.
- An OWNER creates a signed tag running
git tag -s $VERSION
and inserts the changelog into the tag description.
To perform this step, you need a PGP key registered on github. - An OWNER pushes the tag with
git push $VERSION
- Triggers prow to build and publish a staging container image
gcr.io/k8s-staging-jobset/jobset:$VERSION
- Triggers prow to build and publish a staging container image
- Submit a PR against k8s.io,
updatingk8s.gcr.io/images/k8s-staging-jobset/images.yaml
to
promote the container images
to production: - Wait for the PR to be merged and verify that the image
registry.k8s.io/jobset/jobset:$VERSION
is available. - Publish the draft release prepared at the Github releases page.
- Add a link to the tagged release in this issue:
- Send an announcement email to
sig-apps@kubernetes.io
,sig-scheduling@kubernetes.io
andwg-batch@kubernetes.io
with the subject[ANNOUNCE] JobSet $VERSION is released
- Add a link to the release announcement in this issue:
- For a major or minor release, update
README.md
anddocs/setup/install.md
inmain
branch: - For a major or minor release, create an unannotated devel tag in the
main
branch, on the first commit that gets merged after the release
branch has been created (presumably the README update commit above), and, push the tag:
DEVEL=v0.$(($MAJ+1)).0-devel; git tag $DEVEL main && git push $DEVEL
This ensures that the devel builds on themain
branch will have a meaningful version number. - Close this issue
Changelog
Highlights
- New JobSet Failure Policy API - allows users to configure different behavior for different types of errors, enabling them to use compute resources more efficiently and improve ML training goodput.
- Add Coordinator field to JobSet spec, enabling user to define a global coordinator pod for distributed ML/HPC workloads. The stable network endpoint for this pod will be added as a label and annotation to every Job and Pod in the JobSet for easy use in application code. A common use case for this is TPU Multislice training with multiple different Job templates. See linked issue for details.
- Add global Job index label/annotation to every Job and Pod, which is needed to support TPU Multislice training with multiple different Job templates. See linked issue for details.
- Added new metrics
- Improved test coverage
- Bug fixes
- New examples and documentation
What's Changed
- feat: add e2e test for ttl seconds after finished in jobset by @dejanzele in feat: add e2e test for ttl seconds after finished in jobset #511
- add publish not ready headless service to jobset by @kannon92 in add publish not ready headless service to jobset #505
- use kube-openapi rather than code generator openapi-gen by @kannon92 in use kube-openapi rather than code generator openapi-gen #522
- Allow passing args to ginkgo for integration tests by @danielvegamyhre in Allow passing args to ginkgo for integration tests #525
- Refactor create jobs by @danielvegamyhre in Refactor create jobs #516
- Do not default the managedBy field by @mimowo in Do not default the managedBy field #528
- feat: add event recorder event by @googs1025 in feat: add event recorder event #507
- use t.Errorf instead of t.Fatalf by @googs1025 in use t.Errorf instead of t.Fatalf #532
- Fix path for the error when attempting to mutate managedBy by @mimowo in Fix path for the error when attempting to mutate managedBy #527
- Fix bug when checking if a JobSet is active during tests. by @jedwins1998 in Fix bug when checking if a JobSet is active during tests. #531
- Correct typo in configurable failure policy KEP. by @jedwins1998 in Correct typo in configurable failure policy KEP. #539
- fix: fix ci error caused by typo by @googs1025 in fix: fix ci error caused by typo #544
- Bump the kubernetes group with 4 updates by @dependabot in Bump the kubernetes group with 4 updates #542
- Bump github.com/onsi/gomega from 1.32.0 to 1.33.0 by @dependabot in Bump github.com/onsi/gomega from 1.32.0 to 1.33.0 #543
- docs: fix site url not found by @googs1025 in docs: fix site url not found #541
- use hugo param to define variables in md language by @googs1025 in use hugo param to define variables in md language #540
- add unit tests for createHeadlessSvcIfNecessary by @dejanzele in add unit tests for createHeadlessSvcIfNecessary #526
- test: add pod controller unit test by @googs1025 in test: add pod controller unit test #490
- Add comment explaining why we don't unconditionally compute firstFailedJob by @danielvegamyhre in Add comment explaining why we don't unconditionally compute firstFailedJob #549
- Bump github.com/onsi/ginkgo/v2 from 2.17.1 to 2.17.2 by @dependabot in Bump github.com/onsi/ginkgo/v2 from 2.17.1 to 2.17.2 #552
- Track which features in roadmap have been released by @danielvegamyhre in Track which features in roadmap have been released #554
- docs: using kustomize for adjusting resources by @omerap12 in docs: using kustomize for adjusting resources #558
- Bump github.com/onsi/gomega from 1.33.0 to 1.33.1 by @dependabot in Bump github.com/onsi/gomega from 1.33.0 to 1.33.1 #560
- Don't reconcile JobSets with deletion timestamp set by @danielvegamyhre in Don't reconcile JobSets with deletion timestamp set #562
- Improve the API generated docs for managedBy by @mimowo in Improve the API generated docs for managedBy #565
- chore: Upgrade e2e local image by @googs1025 in chore: Upgrade e2e local image #567
- Bump github.com/onsi/ginkgo/v2 from 2.17.2 to 2.17.3 by @dependabot in Bump github.com/onsi/ginkgo/v2 from 2.17.2 to 2.17.3 #569
- Add support for feature gates by @googs1025 in Add support for feature gates #557
- Implement configurable failure policy. by @jedwins1998 in Implement configurable failure policy. #537
- Update the JobSet version to 0.5.1 for installation by @mimowo in Update the JobSet version to 0.5.1 for installation #577
- Bump github.com/onsi/ginkgo/v2 from 2.17.3 to 2.19.0 by @dependabot in Bump github.com/onsi/ginkgo/v2 from 2.17.3 to 2.19.0 #581
- Relax validation on ReplicatedJob PodTemplates of suspended JobSets by @danielvegamyhre in Relax validation on ReplicatedJob PodTemplates of suspended JobSets #580
- update makefile kind version to v1.30.0 by @googs1025 in update makefile kind version to v1.30.0 #589
- Propagate Job pod template updates to suspended jobs when resuming by @danielvegamyhre in Propagate Job pod template updates to suspended jobs when resuming #590
- docs: update to v0.5.2 by @googs1025 in docs: update to v0.5.2 #593
- fix: fix log to avoid panic by @googs1025 in fix: fix log to avoid panic #595
- avoid log panic by @googs1025 in avoid log panic #598
- Add omitempty to annotation of OnJobFailureReasons. by @jedwins1998 in Add omitempty to annotation of OnJobFailureReasons. #596
- update readme docs e2e test version to v1.30 by @googs1025 in update readme docs e2e test version to v1.30 #602
- Update _index.md
MASTER_ADDR
by @song-william in Update _index.mdMASTER_ADDR
#604 - Add client-go example by @danielvegamyhre in Add client-go example #606
- Wait for the webhook service to be listening before advertising the Jobset replica as ready. by @mbobrovskyi in Wait for the webhook service to be listening before advertising the Jobset replica as ready. #608
- docs: add simple example for network field by @googs1025 in docs: add simple example for network field #550
- feat: add terminalState to jobset status by @googs1025 in feat: add terminalState to jobset status #594
- Integration test improvement: rename "update" to "step" by @danielvegamyhre in Integration test improvement: rename "update" to "step" #610
- docs: add argo workflow example for jobset by @googs1025 in docs: add argo workflow example for jobset #612
- docs: add JobSet API reference by @googs1025 in docs: add JobSet API reference #611
- docs: fix typo, Github -> GitHub by @highpon in docs: fix typo, Github -> GitHub #615
- Allow mutating schedulingGates when the Jobset is suspended by @mimowo in Allow mutating schedulingGates when the Jobset is suspended #623
- Add Coordinator field to JobSet spec by @danielvegamyhre in Add Coordinator field to JobSet spec #618
- Validation for Coordinator field by @danielvegamyhre in Validation for Coordinator field #627
- Add example for coordinator by @danielvegamyhre in Add example for coordinator #628
- docs: add prometheus-operator example for jobset by @googs1025 in docs: add prometheus-operator example for jobset #629
- Bump github.com/onsi/gomega from 1.33.1 to 1.34.0 by @dependabot in Bump github.com/onsi/gomega from 1.33.1 to 1.34.0 #631
- Bump github.com/onsi/ginkgo/v2 from 2.19.0 to 2.19.1 by @dependabot in Bump github.com/onsi/ginkgo/v2 from 2.19.0 to 2.19.1 #632
- feat: add metrics for jobset by @googs1025 in feat: add metrics for jobset #614
- docs: update metrics info for site by @googs1025 in docs: update metrics info for site #633
- chore: add github issue, pr template by @googs1025 in chore: add github issue, pr template #634
- Bump github.com/onsi/gomega from 1.34.0 to 1.34.1 by @dependabot in Bump github.com/onsi/gomega from 1.34.0 to 1.34.1 #638
- fix error output by @googs1025 in fix error output #636
- Bump k8s dependencies to 1.30 dependencies and modify update-codegen.sh to be compatible with new code-generator by @danielvegamyhre in Bump k8s dependencies to 1.30 dependencies and modify update-codegen.sh to be compatible with new code-generator #641
- Fix bug in replicatedJobByName by @danielvegamyhre in Fix bug in replicatedJobByName #645
- Allow to update JobSets on suspend by @mimowo in Allow to update JobSets on suspend #644
- Refactor jobset webhook by @danielvegamyhre in Refactor jobset webhook #646
- add the unparam linter to golangci and fix those issues flagged by @kannon92 in add the unparam linter to golangci and fix those issues flagged #643
- drop job-name from labels as it is not used by @kannon92 in drop job-name from labels as it is not used #642
- Bump github.com/onsi/ginkgo/v2 from 2.19.1 to 2.20.0 by @dependabot in Bump github.com/onsi/ginkgo/v2 from 2.19.1 to 2.20.0 #647
- Add new job-id annotation to assign globally unique job index to each job by @danielvegamyhre in Add new job-id annotation to assign globally unique job index to each job #650
- Bump github.com/prometheus/client_golang from 1.19.1 to 1.20.0 by @dependabot in Bump github.com/prometheus/client_golang from 1.19.1 to 1.20.0 #653
- update to k8s 0.30.4 by @kannon92 in update to k8s 0.30.4 #654
Metadata
Assignees
Labels
No labels