Skip to content

KEP-2170: Add manifest overlays for standalone installation #2527

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

Doris-xm
Copy link

What this PR does / why we need it:
As discussed in #2382, we can use fifo sortOption to provide a streamlined setup process, eliminating the need for separate overlays.

Which issue(s) this PR fixes:
Fixes #2526

Checklist:

  • Docs included if any changes are user facing

Signed-off-by: Xinmin Du <2812493086@qq.com>
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this great contribution @Doris-xm! Just one question for you and other folks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove the manifests/overlays/manager and manifests/overlays/runtime? Since we can install them in a all-in-one command

kubectl apply --server-side -k "github.com/kubeflow/trainer/manifests/overlays/standalone

WDYT? @Doris-xm @kubeflow/wg-training-leads @astefanutti

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both are still worth it for GitOps.

Copy link
Member

@andreyvelich andreyvelich Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we discussed it before that some users might want to only install manager or runtimes.
@tenzen-y Actually, I can't imagine the use-case when the runtimes should be installed without manager since they deeply coupled with manager release.
Do we want to keep only manager in a separate overlay for now, and add the runtimes when we get the user feedback for it ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In production (not only for GitOps), they want to install only the required ones. However, I think runtimes are optional since all runtimes are not required by them.

Copy link
Member

@andreyvelich andreyvelich Mar 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we discussed it before that some users might want to only install manager or runtimes.
@tenzen-y Actually, I can't imagine the use-case when the runtimes should be installed without manager since they deeply coupled with manager release.
Do we want to keep only manager in a separate overlay for now, and add the runtimes when we get the user feedback for it ?

namespace: kubeflow-system

resources:
- namespace.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-use namespace from the manager overlay ?

Copy link
Member

@Electronic-Waste Electronic-Waste Mar 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could just rely on the manager overlay, since they are almost the same in installing manager.

It sounds weird just re-using namespace from the manager overlay.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Do you mean we re-use the file from the folder manager, like:

resources:
  - ../manager/namespace.yaml

I just wonder if the standalone overlay is self-contained or depend on manager overlay?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could just rely on the manager overlay, since they are almost the same in installing manager.

We can't do that since manager overlay includes CRDs and manager, and runtimes should be installed after CRDs and before manager.

Do you mean we re-use the file from the folder manager, like:

Yes, that's right @Doris-xm

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and runtimes should be installed after CRDs and before manager.

In fact, we just need to deploy runtimes before webhook. But I'm also okay with the update of the installing order:)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems Kustomization does not allow reference to resource beyond the scope of the current directory.

❯ kustomize build manifests/overlays/default >> ../default.yaml
Error: accumulating resources: accumulation err='accumulating resources from '../manager/namespace.yaml': security; file '/home/master1/uin/training-operator/manifests/overlays/manager/namespace.yaml' is not in or below '/home/master1/uin/training-operator/manifests/overlays/default'': must build at directory: '/home/master1/uin/training-operator/manifests/overlays/manager/namespace.yaml': file is not directory

We may fix it with the arg --load-restrictor=LoadRestrictionsNone. But should we still reuse the manager's namespace.yaml?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer to create a namespace file under the default. WDYT? @andreyvelich

kind: Kustomization

# Namespace where all resources are deployed.
namespace: kubeflow-system
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to update our e2e to use this overlay to deploy Kubeflow Trainer

- ../../../manifests/overlays/manager

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@andreyvelich Should we keep this "wait"? Just to ensure manager is properly deployed before running subsequent tests, I think.

echo "Wait for Kubeflow Trainer to be ready"
(kubectl wait deploy/kubeflow-trainer-controller-manager --for=condition=available -n ${NAMESPACE} --timeout ${TIMEOUT} &&
kubectl wait pods --for=condition=ready -n ${NAMESPACE} --timeout ${TIMEOUT} --all) ||
(
echo "Failed to wait until Kubeflow Trainer is ready" &&
kubectl get pods -n ${NAMESPACE} &&
kubectl describe pods -n ${NAMESPACE} &&
exit 1
)

And one more question: Do we need a validation step for the runtimes deployment since this may not work:

echo "Deploy Kubeflow Trainer runtimes"
kubectl apply --server-side -k manifests/overlays/runtimes || (
kubectl logs -n ${NAMESPACE} -l app.kubernetes.io/name=trainer &&
print_cluster_info &&
exit 1
)

WDYT @Electronic-Waste @tenzen-y Thanks for your time.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the first question, I think we should reserve the "wait" command.

For the second one, I think it's fine to remove the runtime deployment code since we've tested it for many times.

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Mar 17, 2025

I just noticed that the base https://github.com/kubeflow/trainer/blob/0a498fef86bc1417bdff0d2f6c17bffa2e90777e/manifests/base/manager/manager.yaml is missing the securitycontext. See kubeflow/katib#2528 as reference. Please add a securitycontext that satisfies PSS restricted https://kubernetes.io/docs/concepts/security/pod-security-standards/ to all of the pods you create.

Signed-off-by: Xinmin Du <2812493086@qq.com>
Signed-off-by: Xinmin Du <2812493086@qq.com>
@google-oss-prow google-oss-prow bot added size/S and removed size/M labels Mar 18, 2025
Signed-off-by: Xinmin Du <2812493086@qq.com>
@google-oss-prow google-oss-prow bot added size/M and removed size/S labels Mar 19, 2025
Signed-off-by: Xinmin Du <2812493086@qq.com>
@Doris-xm
Copy link
Author

/rerun-all

E2E_MANIFESTS_DIR="artifacts/e2e/manifests"
mkdir -p "${E2E_MANIFESTS_DIR}"
cat <<EOF > "${E2E_MANIFESTS_DIR}/kustomization.yaml"
apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
- ../../../manifests/overlays/manager
- ../../../manifests/overlays/default
images:
- name: "${CONTROLLER_MANAGER_CI_IMAGE_NAME}"
newTag: "${CONTROLLER_MANAGER_CI_IMAGE_TAG}"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

E2E test Error:

resource mapping not found for name: "mpi-distributed" namespace: "kubeflow-system" from "artifacts/e2e/manifests": no matches for kind "ClusterTrainingRuntime" in version "trainer.kubeflow.org/v1alpha1"
ensure CRDs are installed first
resource mapping not found for name: "torch-distributed" namespace: "kubeflow-system" from "artifacts/e2e/manifests": no matches for kind "ClusterTrainingRuntime" in version "trainer.kubeflow.org/v1alpha1"
ensure CRDs are installed first
make: *** [Makefile:142: test-e2e-setup-cluster] Error 1

CRD error could be solved by replacing kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}" with

echo "Applying CRDs separately"
kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}../../../../manifests/base/crds"

echo "Deploying the rest of the Kubeflow Trainer resources"
kubectl apply --server-side -k "${E2E_MANIFESTS_DIR}"

But is this desired way to resolve this error?

Copy link
Author

@Doris-xm Doris-xm Mar 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for help @izuku-sds . It can be solved by deploying CRD first. But we want an all-in-one deployment solution, I think.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we want an all-in-one deployment solution, I think.

Thats true @Doris-xm , I couldnt find it, so following this pr to see suitable solution.
WDYT @Electronic-Waste @andreyvelich @tenzen-y

@juliusvonkohout
Copy link
Member

juliusvonkohout commented Mar 27, 2025

I just noticed that the base https://github.com/kubeflow/trainer/blob/0a498fef86bc1417bdff0d2f6c17bffa2e90777e/manifests/base/manager/manager.yaml is missing the securitycontext. See kubeflow/katib#2528 as reference. Please add a securitycontext that satisfies PSS restricted https://kubernetes.io/docs/concepts/security/pod-security-standards/ to all of the pods you create.

ah and one other thing i forgot, please make sure to write Kustomize 5 manifests, such that there are no warnings when running kustomize build / kubectl apply -k. In general we target kustomize 5.2.1+

Signed-off-by: Xinmin Du <2812493086@qq.com>
@Electronic-Waste
Copy link
Member

ah and one other thing i forgot, please make sure to write Kustomize 5 manifests, such that there are no warnings when running kustomize build / kubectl apply -k. In general we target kustomize 5.2.1+

@juliusvonkohout AFAIK, we've supported Kustomize 5 in: #2326. Does that look good to you?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

KEP-2170: Add manifest overlays for standalone installation
6 participants