Skip to content

Conversation

zbbkeepgoing
Copy link
Contributor

For solve this situation

  • When we use spark client model to submit spark job in k8s with volcano scheduler. hope all executors can be managed and scheduled through podgroup.

  • hope some single pods can be managed and scheduled through podgroup with volcano.

@volcano-sh-bot volcano-sh-bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 18, 2022
@zbbkeepgoing
Copy link
Contributor Author

Link to the api pull request: volcano-sh/apis#83. I will update go.mod after pr of api is merged

Copy link
Contributor

@Thor-wl Thor-wl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey. Kindly to ask some questions about the background:

  • Why do you want to separate the whole spark job to different and independent pods instead of making use of spark job or volcano job?
  • I've not had a try about that. But as what I understand, Volcano has supported to generate PodGroup for separate pods. Is there something wrong with your test?

@zbbkeepgoing
Copy link
Contributor Author

Hey. Kindly to ask some questions about the background:

  • Why do you want to separate the whole spark job to different and independent pods instead of making use of spark job or volcano job?
  • I've not had a try about that. But as what I understand, Volcano has supported to generate PodGroup for separate pods. Is there something wrong with your test?

1、In Spark's Client submit mode. all the executor pod is independent, if we want to manage them by volcano, it need to group them in a ng. And a large part of the current spark users are also submitted using the client mode.
2、Yes, volcano support generate pg for separate pods, but not support pg's minresource not support, and this pr will support it. In addition, the generated podgroup only refer the first pod, if the first pod terminated, pg will be delete auto.

So the pr is support minresource for independent pods, and make pg refer to all the independent pods.

@@ -145,6 +146,15 @@ func validatePod(pod *v1.Pod, reviewResponse *admissionv1.AdmissionResponse) str
return msg
}

func isVcJob(pod *v1.Pod) bool {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest to rename to BelongToVcJob

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@zbbkeepgoing
Copy link
Contributor Author

/retest

@volcano-sh-bot
Copy link
Contributor

@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@@ -145,6 +146,15 @@ func validatePod(pod *v1.Pod, reviewResponse *admissionv1.AdmissionResponse) str
return msg
}

func BelongToVcJob(pod *v1.Pod) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why make this public?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved to utils.go, may it be used elsewhere in the future?

@hwdef
Copy link
Member

hwdef commented Aug 19, 2022

please compress multiple commits into a single commit.

@zbbkeepgoing
Copy link
Contributor Author

please compress multiple commits into a single commit.

done

@zbbkeepgoing
Copy link
Contributor Author

/retest

@volcano-sh-bot
Copy link
Contributor

@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to this:

/retest

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@zbbkeepgoing zbbkeepgoing requested review from Thor-wl and hwdef and removed request for shinytang6, william-wang, wpeng102, Thor-wl and hwdef August 22, 2022 05:49
@zbbkeepgoing
Copy link
Contributor Author

/assign @wangyang0616

But it can not manage and schedule multiple single pods in one pg now. And this missing ability is very useful in spark client mode and so on.
It will also play a role in some scenarios that require multiple single pods to work together (vj cannot be used).

## Key Points
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current PR implementation, users are allowed to add MinResource resource information through annotation, but podgroup has many attributes, such as: MinMember, Queue, etc. The PR does not support the setting of these information. Can you add in-range and out-of-range in the document? Instructions are easier for users to understand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add some limitation to the current

@wangyang0616
Copy link
Member

In the scenario you describe, I agree to enhance the podgroup’s ability to manage a batch of independent pods, but I still have some concerns. MinResources can be configured through annotations. Will other attributes of the podgroup support this method in the future? Same as @Yikun considered, if they are all introduced, then does the crd of podgroup still need to exist?

cc @william-wang @jinzhejz

@zbbkeepgoing
Copy link
Contributor Author

In the scenario you describe, I agree to enhance the podgroup’s ability to manage a batch of independent pods, but I still have some concerns. MinResources can be configured through annotations. Will other attributes of the podgroup support this method in the future? Same as @Yikun considered, if they are all introduced, then does the crd of podgroup still need to exist?

cc @william-wang @jinzhejz

For the former, I think there are suitable scenarios that we can add as needed, but for the latter, I hold different views. Because pg needs to exist as a group that manages pods, it is not just a simple CRD, it is the link between the scheduling units and the scheduling logic in the Volcano scheduler. I think this enhancement is just an increase in the PodGroup creation entry.

For example, VJ is currently the most native management entrance for PodGroups, and it may cover some batch scenarios. "spark.kubernetes.scheduler.volcano.podGroupTemplateFile" is the entry point for Spark to manage PodGroup natively, and it covers some big data scenarios. The current enhancement is to manage the PodGroup's entrance through the PodGroup's Controller, which covers some special scenarios.

Their existence has nothing to do with the PodGroup's CRD in nature, just different ways to creating and managing PodGroups. PodGroup still play a vital role in the scheduler. If we don't use PodGroup to play this link between Controller and Scheduler, just use Annotation to complete it. That is a challenge for the code complexity, scalability, and rationality of the Scheduler layer.

Signed-off-by: Binbin Zou <binbin.zou@kyligence.io>

fix e2e

Signed-off-by: Binbin Zou <binbin.zou@kyligence.io>

Update UT & Update podgroup's min resource

Signed-off-by: Binbin Zou <binbin.zou@kyligence.io>

Add user-guide doc & slove concurrent security

Signed-off-by: Binbin Zou <binbin.zou@kyligence.io>

fix doc

Signed-off-by: Binbin Zou <binbin.zou@kyligence.io>

Add Limitations

Signed-off-by: Binbin Zou <binbin.zou@kyligence.io>
@stale
Copy link

stale bot commented Jun 10, 2023

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 10, 2023
@hwdef
Copy link
Member

hwdef commented Jun 11, 2023

still needs to be reviewed

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 11, 2023
@lowang-bh
Copy link
Member

In my opition, we can create a vocano podgroup manully to handle those 3 pods. Then those pods will be contolled via the podgroup.

image: busybox:1.28
command: ['sh', '-c', 'echo "Hello, busybox3!" && sleep 3600']
```
2. Kubectl describe pg.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opition, you can create a vocano podgroup manully to handle those 3 pods. Then those pods will be contolled via the podgroup.

@stale
Copy link

stale bot commented Oct 15, 2023

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 15, 2023
@stale stale bot closed this Dec 15, 2023
@hwdef
Copy link
Member

hwdef commented Dec 15, 2023

/reopen

@volcano-sh-bot
Copy link
Contributor

@hwdef: Reopened this PR.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 15, 2023
@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign william-wang
You can assign the PR to them by writing /assign @william-wang in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@lowang-bh
Copy link
Member

lowang-bh commented Dec 16, 2023

/hold

Copy link

stale bot commented Mar 17, 2024

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 17, 2024
@volcano-sh-bot volcano-sh-bot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Apr 9, 2024
@stale stale bot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 9, 2024
@volcano-sh-bot
Copy link
Contributor

@zbbkeepgoing: PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Copy link

stale bot commented Apr 26, 2025

Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward?

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 26, 2025
@stale stale bot closed this May 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants