-
Notifications
You must be signed in to change notification settings - Fork 772
KEP-2170: Add manifest overlays for standalone installation #2527
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Signed-off-by: Xinmin Du <2812493086@qq.com>
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this great contribution @Doris-xm! Just one question for you and other folks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove the manifests/overlays/manager
and manifests/overlays/runtime
? Since we can install them in a all-in-one command
kubectl apply --server-side -k "github.com/kubeflow/trainer/manifests/overlays/standalone
WDYT? @Doris-xm @kubeflow/wg-training-leads @astefanutti
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both are still worth it for GitOps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we discussed it before that some users might want to only install manager or runtimes.
@tenzen-y Actually, I can't imagine the use-case when the runtimes should be installed without manager since they deeply coupled with manager release.
Do we want to keep only manager in a separate overlay for now, and add the runtimes when we get the user feedback for it ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In production (not only for GitOps), they want to install only the required ones. However, I think runtimes are optional since all runtimes are not required by them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we discussed it before that some users might want to only install manager or runtimes.
@tenzen-y Actually, I can't imagine the use-case when the runtimes should be installed without manager since they deeply coupled with manager release.
Do we want to keep only manager in a separate overlay for now, and add the runtimes when we get the user feedback for it ?
namespace: kubeflow-system | ||
|
||
resources: | ||
- namespace.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we re-use namespace from the manager overlay ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could just rely on the manager overlay, since they are almost the same in installing manager.
It sounds weird just re-using namespace from the manager overlay.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Do you mean we re-use the file from the folder manager
, like:
resources:
- ../manager/namespace.yaml
I just wonder if the standalone
overlay is self-contained or depend on manager
overlay?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could just rely on the manager overlay, since they are almost the same in installing manager.
We can't do that since manager overlay includes CRDs and manager, and runtimes should be installed after CRDs and before manager.
Do you mean we re-use the file from the folder manager, like:
Yes, that's right @Doris-xm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and runtimes should be installed after CRDs and before manager.
In fact, we just need to deploy runtimes before webhook. But I'm also okay with the update of the installing order:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems Kustomization does not allow reference to resource beyond the scope of the current directory.
❯ kustomize build manifests/overlays/default >> ../default.yaml
Error: accumulating resources: accumulation err='accumulating resources from '../manager/namespace.yaml': security; file '/home/master1/uin/training-operator/manifests/overlays/manager/namespace.yaml' is not in or below '/home/master1/uin/training-operator/manifests/overlays/default'': must build at directory: '/home/master1/uin/training-operator/manifests/overlays/manager/namespace.yaml': file is not directory
We may fix it with the arg --load-restrictor=LoadRestrictionsNone
. But should we still reuse the manager's namespace.yaml?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer to create a namespace file under the default. WDYT? @andreyvelich
kind: Kustomization | ||
|
||
# Namespace where all resources are deployed. | ||
namespace: kubeflow-system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to update our e2e to use this overlay to deploy Kubeflow Trainer
trainer/hack/e2e-setup-cluster.sh
Line 50 in 0a498fe
- ../../../manifests/overlays/manager |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andreyvelich Should we keep this "wait"? Just to ensure manager
is properly deployed before running subsequent tests, I think.
trainer/hack/e2e-setup-cluster.sh
Lines 59 to 67 in 0a498fe
echo "Wait for Kubeflow Trainer to be ready" | |
(kubectl wait deploy/kubeflow-trainer-controller-manager --for=condition=available -n ${NAMESPACE} --timeout ${TIMEOUT} && | |
kubectl wait pods --for=condition=ready -n ${NAMESPACE} --timeout ${TIMEOUT} --all) || | |
( | |
echo "Failed to wait until Kubeflow Trainer is ready" && | |
kubectl get pods -n ${NAMESPACE} && | |
kubectl describe pods -n ${NAMESPACE} && | |
exit 1 | |
) |
And one more question: Do we need a validation step for the runtimes
deployment since this may not work:
trainer/hack/e2e-setup-cluster.sh
Lines 78 to 83 in 0a498fe
echo "Deploy Kubeflow Trainer runtimes" | |
kubectl apply --server-side -k manifests/overlays/runtimes || ( | |
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=trainer && | |
print_cluster_info && | |
exit 1 | |
) |
WDYT @Electronic-Waste @tenzen-y Thanks for your time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the first question, I think we should reserve the "wait" command.
For the second one, I think it's fine to remove the runtime deployment code since we've tested it for many times.
I just noticed that the base https://github.com/kubeflow/trainer/blob/0a498fef86bc1417bdff0d2f6c17bffa2e90777e/manifests/base/manager/manager.yaml is missing the securitycontext. See kubeflow/katib#2528 as reference. Please add a securitycontext that satisfies PSS restricted https://kubernetes.io/docs/concepts/security/pod-security-standards/ to all of the pods you create. |
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
/rerun-all |
E2E_MANIFESTS_DIR="artifacts/e2e/manifests" | ||
mkdir -p "${E2E_MANIFESTS_DIR}" | ||
cat <<EOF > "${E2E_MANIFESTS_DIR}/kustomization.yaml" | ||
apiVersion: kustomize.config.k8s.io/v1beta1 | ||
kind: Kustomization | ||
resources: | ||
- ../../../manifests/overlays/manager | ||
- ../../../manifests/overlays/default | ||
images: | ||
- name: "${CONTROLLER_MANAGER_CI_IMAGE_NAME}" | ||
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resource mapping not found for name: "mpi-distributed" namespace: "kubeflow-system" from "artifacts/e2e/manifests": no matches for kind "ClusterTrainingRuntime" in version "trainer.kubeflow.org/v1alpha1"
ensure CRDs are installed first
resource mapping not found for name: "torch-distributed" namespace: "kubeflow-system" from "artifacts/e2e/manifests": no matches for kind "ClusterTrainingRuntime" in version "trainer.kubeflow.org/v1alpha1"
ensure CRDs are installed first
make: *** [Makefile:142: test-e2e-setup-cluster] Error 1
CRD error could be solved by replacing kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}"
with
echo "Applying CRDs separately"
kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}../../../../manifests/base/crds"
echo "Deploying the rest of the Kubeflow Trainer resources"
kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}"
But is this desired way to resolve this error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for help @izuku-sds . It can be solved by deploying CRD first. But we want an all-in-one deployment solution, I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But we want an all-in-one deployment solution, I think.
Thats true @Doris-xm , I couldnt find it, so following this pr to see suitable solution.
WDYT @Electronic-Waste @andreyvelich @tenzen-y
ah and one other thing i forgot, please make sure to write Kustomize 5 manifests, such that there are no warnings when running kustomize build / kubectl apply -k. In general we target kustomize 5.2.1+ |
Signed-off-by: Xinmin Du <2812493086@qq.com>
@juliusvonkohout AFAIK, we've supported Kustomize 5 in: #2326. Does that look good to you? |
What this PR does / why we need it:
As discussed in #2382, we can use
fifo
sortOption to provide a streamlined setup process, eliminating the need for separate overlays.Which issue(s) this PR fixes:
Fixes #2526
Checklist: