Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Node Feature Discovery #10861

Merged
merged 1 commit into from
Mar 5, 2024

Conversation

yankay
Copy link
Member

@yankay yankay commented Jan 31, 2024

What type of PR is this?
/kind feature

What this PR does / why we need it:

Node Feature Discovery is a Kubernetes add-on for detecting hardware features and system configuration.

It's widely used and is required by the GPU Operator.
And it's help to AI-optimized k8s cluster, ref #10622.

Which issue(s) this PR fixes:

Fixes #10854

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

Support Node Feature Discovery

.

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jan 31, 2024
@k8s-ci-robot k8s-ci-robot added the size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. label Jan 31, 2024
@yankay yankay changed the title Support Node Feature Discovery [WIP]Support Node Feature Discovery Jan 31, 2024
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2024
@yankay yankay force-pushed the nfd-support branch 3 times, most recently from c8da6d2 to c06fddf Compare January 31, 2024 11:39
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
@yankay yankay changed the title [WIP]Support Node Feature Discovery Support Node Feature Discovery Jan 31, 2024
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jan 31, 2024
@yankay
Copy link
Member Author

yankay commented Feb 4, 2024

HI @mzaian, @VannTen
Would you please help to review the PR :-)

@VannTen
Copy link
Contributor

VannTen commented Feb 5, 2024

Should we really add more components to Kubespray ? One the thing mentionned in #6400 was about reducing Kubespray scope and maintenance load.

NFD has an helm chart right ?

@yankay
Copy link
Member Author

yankay commented Feb 6, 2024

Should we really add more components to Kubespray ? One the thing mentionned in #6400 was about reducing Kubespray scope and maintenance load.

NFD has an helm chart right ?

Thanks @VannTen

I think there are 3 topics.


1 Does the Kubespray should upgrade the GPU support?"

GPU Support has been a feature of kubespray since #3304.
The PR is part of "GPU Operator" because it is a more popular and easier-to-use method for supporting GPUs. After the GPU Operator is supported, the old method of using GPUs should be deprecated.

So, The kubespray should upgrade the GPU Support with GPU Operator.


2 "Do the addons of kubespray need to be migrated to helm chats?"

There are great discussions at #6400. Maintenance of the addon's jinjia2 template is not reasonable work.

The Helm way of installing add-ons is great, like https://docs.k3s.io/helm. Using the helm to install add-ons would simplify this project and the maintenance burden.

In the long term, I want to choose the "Helm style" instead of the "Jinjia2 style. " If it works, I think some add-ons should be migrated to it.
PS. Customs CNI, Kubelet csr approver is "Helm style"


3 "Do the NFD add-ons need to be helm chats?"

Because there is no shared view of "Helm style." The PR uses the old way like the other add-ons, or the "helm style" is OK.

What do you think about that @VannTen ?

@VannTen
Copy link
Contributor

VannTen commented Feb 6, 2024 via email

@yankay
Copy link
Member Author

yankay commented Feb 6, 2024

I think there is some addons using helm currently (kubelet_csr_approver ?) optionally. However the helm_apps role has a bad interface (I wrote it T_T) and I plan to rewrite it in the mid-term (if I can pull that off this month that would be great). We should not block other things ofc. OTOH, for kubespray users, packaging NFD or GPU operator is mostly convenience, isn't it ? I mean, is the setup more involved than installing the helm chart ?

Thanks @VannTen

Looking forward to the new helm role very much. 💕💕💕

Q: “Is the setup more involved than installing the helm chart ?”
A: In my opinion, the add-on (helm style or jinjia2 style) in Kubespray has some benefits over "helm charts without Kubespray. "

  1. It's integrated and tested.
  2. It's all based on Ansible; the users can use Ansible var xx_enabled to enable an addon.
  3. The kubespray can do some prerequisite checks for the addon, but the helm chart cannot. So it makes it easy. like https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/getting-started.html#prerequisites.
  4. Some var is important for many components, such as kube_proxy_strict_arp. It's useful for kube-proxy, kube-vip, and metallb. If the user installs kube-vip without kubespray, He will take care of it himself.
  5. In an air-gap environment, Kubespray declares the image. The helm does not do it.

So, if the add-on has many system relationships, like CNI/GPU, Kubespray can make it easy to install it. But if the addon does not have system requirements, like a Java web application, there is no need to use Kubespray to install it.

Copy link
Contributor

@mzaian mzaian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @yankay

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 5, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mzaian, yankay

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot merged commit 90b0151 into kubernetes-sigs:master Mar 5, 2024
65 checks passed
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request Mar 23, 2024
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
@mzaian mzaian mentioned this pull request Apr 26, 2024
dibi-codes pushed a commit to fino-digital/kubespray that referenced this pull request May 7, 2024
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
pedro-peter pushed a commit to pedro-peter/kubespray that referenced this pull request May 8, 2024
Signed-off-by: Kay Yan <kay.yan@daocloud.io>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node feature discovery Support
4 participants