-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Managed and scheduled multiple single pods with podgroup #2358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Link to the api pull request: volcano-sh/apis#83. I will update go.mod after pr of api is merged |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey. Kindly to ask some questions about the background:
- Why do you want to separate the whole spark job to different and independent pods instead of making use of spark job or volcano job?
- I've not had a try about that. But as what I understand, Volcano has supported to generate PodGroup for separate pods. Is there something wrong with your test?
1、In Spark's Client submit mode. all the executor pod is independent, if we want to manage them by volcano, it need to group them in a ng. And a large part of the current spark users are also submitted using the client mode. So the pr is support minresource for independent pods, and make pg refer to all the independent pods. |
@@ -145,6 +146,15 @@ func validatePod(pod *v1.Pod, reviewResponse *admissionv1.AdmissionResponse) str | |||
return msg | |||
} | |||
|
|||
func isVcJob(pod *v1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest to rename to BelongToVcJob
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
/retest |
@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@@ -145,6 +146,15 @@ func validatePod(pod *v1.Pod, reviewResponse *admissionv1.AdmissionResponse) str | |||
return msg | |||
} | |||
|
|||
func BelongToVcJob(pod *v1.Pod) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why make this public?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved to utils.go, may it be used elsewhere in the future?
please compress multiple commits into a single commit. |
done |
/retest |
@zbbkeepgoing: Cannot trigger testing until a trusted user reviews the PR and leaves an In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/assign @wangyang0616 |
But it can not manage and schedule multiple single pods in one pg now. And this missing ability is very useful in spark client mode and so on. | ||
It will also play a role in some scenarios that require multiple single pods to work together (vj cannot be used). | ||
|
||
## Key Points |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current PR implementation, users are allowed to add MinResource resource information through annotation, but podgroup has many attributes, such as: MinMember, Queue, etc. The PR does not support the setting of these information. Can you add in-range and out-of-range in the document? Instructions are easier for users to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add some limitation to the current
In the scenario you describe, I agree to enhance the podgroup’s ability to manage a batch of independent pods, but I still have some concerns. MinResources can be configured through annotations. Will other attributes of the podgroup support this method in the future? Same as @Yikun considered, if they are all introduced, then does the crd of podgroup still need to exist? |
For the former, I think there are suitable scenarios that we can add as needed, but for the latter, I hold different views. Because pg needs to exist as a group that manages pods, it is not just a simple CRD, it is the link between the scheduling units and the scheduling logic in the Volcano scheduler. I think this enhancement is just an increase in the PodGroup creation entry. For example, VJ is currently the most native management entrance for PodGroups, and it may cover some batch scenarios. "spark.kubernetes.scheduler.volcano.podGroupTemplateFile" is the entry point for Spark to manage PodGroup natively, and it covers some big data scenarios. The current enhancement is to manage the PodGroup's entrance through the PodGroup's Controller, which covers some special scenarios. Their existence has nothing to do with the PodGroup's CRD in nature, just different ways to creating and managing PodGroups. PodGroup still play a vital role in the scheduler. If we don't use PodGroup to play this link between Controller and Scheduler, just use Annotation to complete it. That is a challenge for the code complexity, scalability, and rationality of the Scheduler layer. |
Signed-off-by: Binbin Zou <binbin.zou@kyligence.io> fix e2e Signed-off-by: Binbin Zou <binbin.zou@kyligence.io> Update UT & Update podgroup's min resource Signed-off-by: Binbin Zou <binbin.zou@kyligence.io> Add user-guide doc & slove concurrent security Signed-off-by: Binbin Zou <binbin.zou@kyligence.io> fix doc Signed-off-by: Binbin Zou <binbin.zou@kyligence.io> Add Limitations Signed-off-by: Binbin Zou <binbin.zou@kyligence.io>
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
still needs to be reviewed |
In my opition, we can create a vocano podgroup manully to handle those 3 pods. Then those pods will be contolled via the podgroup. |
image: busybox:1.28 | ||
command: ['sh', '-c', 'echo "Hello, busybox3!" && sleep 3600'] | ||
``` | ||
2. Kubectl describe pg. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opition, you can create a vocano podgroup manully to handle those 3 pods. Then those pods will be contolled via the podgroup.
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
/reopen |
@hwdef: Reopened this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/hold |
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
@zbbkeepgoing: PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Is this still relevant? If so, what is blocking it? Is there anything you can do to help move it forward? This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
For solve this situation
When we use spark client model to submit spark job in k8s with volcano scheduler. hope all executors can be managed and scheduled through podgroup.
hope some single pods can be managed and scheduled through podgroup with volcano.