Skip to content

Conversation

calvin0327
Copy link
Contributor

@calvin0327 calvin0327 commented Apr 26, 2024

I found a some bit err message when using jobflow feature, I create a jobflow resource ref:
https://github.com/volcano-sh/volcano/blob/master/example/jobflow/JobFlow.yaml
https://github.com/volcano-sh/volcano/blob/master/example/jobflow/JobTemplate.yaml

here's controller manager logs:

[root@master01 ~]# kubectl logs -n volcano-system volcano-controllers-744bc4796d-jbncj | grep ^E
E0425 10:34:49.690189       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:34:49.707411       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:34:50.321009       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:34:51.395417       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:04.721574       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:04.736015       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:05.568771       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:05.581852       1 jobflow_controller_action.go:69] Failed to update status of JobFlow default/test: Operation cannot be fulfilled on jobflows.flow.volcano.sh "test": the object has been modified; please apply your changes to the latest version and try again
E0425 10:35:20.711708       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:21.731150       1 queue_controller_action.go:85] Failed to update status of Queue default: Operation cannot be fulfilled on queues.scheduling.volcano.sh "default": the object has been modified; please apply your changes to the latest version and try again.
E0425 10:35:34.692296       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-b, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-b> is not ready
E0425 10:35:34.695945       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-b, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-b> is not ready
E0425 10:35:34.698687       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-c, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-c> is not ready
E0425 10:35:34.701790       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-c, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-c> is not ready
E0425 10:35:34.707817       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-d, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-d> is not ready
E0425 10:35:34.712693       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-d, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-d> is not ready
E0425 10:35:34.714371       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-e, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-e> is not ready
E0425 10:35:34.715187       1 jobflow_controller_action.go:300] Failed to delete job of JobFlow default/test: jobs.batch.volcano.sh "test-a" not found
E0425 10:35:34.715210       1 jobflow_controller_action.go:46] Failed to delete jobs of JobFlow default/test: jobs.batch.volcano.sh "test-a" not found
E0425 10:35:34.717377       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-e, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-e> is not ready
E0425 10:35:34.723456       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-a, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: job <default/test-a> is not ready
E0425 10:35:34.728548       1 job_controller.go:334] Failed to get job by <Queue: , Job: default/test-a, Task:default-nginx, Event:PodEvicted, ExitCode:0, Action:, JobVersion: 0> from cache: failed to find job <default/test-a>

The pr focuses only on jobflow_controllers.go errors.

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign shinytang6
You can assign the PR to them by writing /assign @shinytang6 in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 26, 2024
Signed-off-by: calvin <wen.chen@daocloud.io>
@calvin0327 calvin0327 force-pushed the optimizate-workflow-controller branch from fb66ac0 to a6fad98 Compare April 26, 2024 09:01
@@ -63,7 +65,26 @@ func (jf *jobflowcontroller) syncJobFlow(jobFlow *v1alpha1flow.JobFlow, updateSt
}
jobFlow.Status = *jobFlowStatus
updateStateFn(&jobFlow.Status, len(jobFlow.Spec.Flows))
_, err = jf.vcClient.FlowV1alpha1().JobFlows(jobFlow.Namespace).UpdateStatus(context.Background(), jobFlow, metav1.UpdateOptions{})

err = retry.RetryOnConflict(retry.DefaultRetry, func() error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the retry mechanism after resource version conflicts to avoid the next reconcile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's truly a problem, but this way seems will be more time consuming.

@@ -281,6 +302,12 @@ func (jf *jobflowcontroller) deleteAllJobsCreatedByJobFlow(jobFlow *v1alpha1flow
for _, job := range jobList {
err := jf.vcClient.BatchV1alpha1().Jobs(jobFlow.Namespace).Delete(context.Background(), job.Name, metav1.DeleteOptions{})
if err != nil {
if apierrors.IsNotFound(err) {
Copy link
Contributor Author

@calvin0327 calvin0327 Apr 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ingnore this error return if the job no longer exist.

@calvin0327
Copy link
Contributor Author

/auto-cc

@calvin0327
Copy link
Contributor Author

@lowang-bh @hwdef PTAL

@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 24, 2024
@volcano-sh-bot
Copy link
Contributor

@calvin0327: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

stale bot commented Feb 1, 2025

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 1, 2025
@hwdef
Copy link
Member

hwdef commented Feb 10, 2025

I'll check this later.
Thanks for your contribution.

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Feb 10, 2025
@hwdef
Copy link
Member

hwdef commented Feb 10, 2025

@calvin0327
Are you still doing this? Please rebase the master branch.

Copy link

stale bot commented Apr 25, 2025

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 25, 2025
@stale stale bot closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants