-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRA: Integrates with DRA and CDI #3329
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #3329 +/- ##
=======================================
Coverage 81.31% 81.31%
=======================================
Files 50 50
Lines 4352 4352
=======================================
Hits 3539 3539
Misses 661 661
Partials 152 152
Flags with carried forward coverage won't be shown. Click here to find out more. |
94c4e52
to
d75b9c1
Compare
// +kubebuilder:validation:Optional | ||
MultusNames []string `json:"multusNames,omitempty"` | ||
|
||
// +kubebuilder:validation:Optional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可以这这些参数用途 做些 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这些字段还没有完全确定,我想这个pr可以不用先确定它们
pkg/k8s/apis/spiderpool.spidernet.io/v2beta1/spiderclaimparameters_types.go
Outdated
Show resolved
Hide resolved
need : |
return &driver{spiderClientset: spiderClientset} | ||
} | ||
|
||
func (d driver) GetClassParameters(ctx context.Context, class *resourcev1alpha2.ResourceClass) (interface{}, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
小白下,这个是什么时机被触发调用的?比较如 resourceclass 被创建时 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
实际上这个函数是被 dra-controller 调用 allocate 时用的。目前dra的实现暂时没看到有使用到,我猜测可以用在 resourceclass 资源创建时,dra-plugin 读取之后,完成节点代表该 resourceclass 的硬件资源一些初始化操作
} | ||
|
||
func (d driver) allocate(ctx context.Context, claim *resourcev1alpha2.ResourceClaim, claimParameters interface{}, class *resourcev1alpha2.ResourceClass, classParameters interface{}, selectedNode string) (*resourcev1alpha2.AllocationResult, error) { | ||
if selectedNode == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
看这个逻辑,每个 selectedNode 代表一个node ? 在匹配过程中,selectedNode 并没有与 claimParameters 等进行 匹配过滤,还没有 调度的效果 ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
目前的代码中还没有涉及到调度,这个 selectedNode 是 kube-scheduler 设置的。Dra支持两种分配策略: 立即分配和延迟分配。目前不支持立即分配(即创建resourceclaim时就分配)。
pkg/dra/dra-plugin/plugin.go
Outdated
"k8s.io/dynamic-resource-allocation/kubeletplugin" | ||
) | ||
|
||
func StartDRAPlugin(logger *zap.Logger, cdiRoot, so string) (kubeletplugin.DRAPlugin, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so string 可以修改为一个 字典,将来方便扩展 挂入多个 so
map[featureNameString]SoPathString
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
slice 就足够?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
挂载多个 so 感觉没有意义?挂载多个时 LD_PRELOAD 变量该如何指定?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
我是在想一个易扩展的框架,将来方便 扩展新 so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
可能你想要的是在不变动代码的情况下,就能够轻松扩展新so。但对于一项新的 feature 来说,不仅仅是需要 so,还需要 ENV 等等其他条件,所以这不是能够确定的东西。而目前的框架如果需要开发一项新的feature,他需要做以下的代码工作:
- spiderclaimparameter 中添加新的字段
- 修改dra-plugin 代码,change cdi file
通过 spiderclaimparameter 来控制 feature 开关,是比较标准的方式。你说的方式是全局生效,即使有些pod 不需要某个feature,只要安装时指定了,创建pod的时候就会挂载,没办法做到更细腻度的控制
39867d1
to
45e1169
Compare
932793c
to
03d6780
Compare
/cc @weizhoublue The PR is ready to merge. |
6. Create resource files such as workloads and resourceClaim. | ||
|
||
``` | ||
~# export NAME=demo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
export NAME=demo
然后一坨 yaml , 这样是不可能执行成功的
docs/usage/dra_zh_CN.md
Outdated
|
||
目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力: | ||
|
||
* 可根据 Pod 使用的子网和网卡信息,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
说明下 原理或实施条件: 根据 node 上 master 网卡上报的 情况,结合 multusconfigure 中的 master 接口、ippool 等 三个信息来综合调度 ?
docs/usage/dra_zh_CN.md
Outdated
目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力: | ||
|
||
* 可根据 Pod 使用的子网和网卡信息,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动 | ||
* 统一多个 device-plugin 的资源声明方式 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
说明下 哪些 device-plugin ,给出 refercen 链接 ? 说明哪个 字段 对应哪个 device-plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
说明哪个 字段 对应哪个 device-plugin
什么字段?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parmeter 那个 CRD ,用法有个说明
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io> Now spiderpool integrates dra and cdi, which allows for some complex scheduling and better manipulation of hardware resources based on dra.
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
Signed-off-by: cyclinder <qifeng.guo@daocloud.io>
|
||
目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力: | ||
|
||
* 可根据每个节点上报的网卡和子网信息,并结合 Pod 使用的 SpiderMultusConfig 配置,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(1)说明的太简单,我需要详细点,工作方式,如何排查(网卡ip信息上报到某个 crd?),SpiderMultusConfig 配置 具体是什么,或者举个例子,什么样子能够调度上去
(2)主机1 的 eth0 10网段, pod 声明 macvlan master eth1 , 要求子网10,这样也能调度到 主机1 上 ?
或者这篇文档 在某个一节,具体说明下这个东西
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这部分内容统一在功能实现了之后补充
1. 准备一个高版本的 Kubernetes 集群, 推荐版本大于 v1.29.0, 并且开启集群的 dra feature-gate 功能 | ||
2. 已安装 Kubectl、[Helm](https://helm.sh/docs/intro/install/) | ||
|
||
## 快速开始 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这里需要说明下 这个快速开始 为为了演示什么?(子网调度?srivo vf 获取 ?),不需要让读者 看完整个流程 才知道
目前 Spiderpool 已经集成 DRA 框架,基于该功能可实现以下但不限于的能力: | ||
|
||
* 可根据每个节点上报的网卡和子网信息,并结合 Pod 使用的 SpiderMultusConfig 配置,自动调度到合适的节点,避免 Pod 调度到节点之后无法启动 | ||
* 在 SpiderClaimParameter 中统一多个 device-plugin 如 [sriov-network-device-plugin](https://github.com/k8snetworkplumbingwg/sriov-network-device-plugin), [k8s-rdma-shared-dev-plugin](https://github.com/Mellanox/k8s-rdma-shared-dev-plugin) 的资源使用方式 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
太简单了,作为小白,他得知道如何使用,如何排障
例如,SpiderClaimParameter 中的 哪个字段 能生效某个 device plugin
的功能,如果 启动 dra 功能,pod 的resource 声明了的话,谁先生效 ?
Thanks for contributing!
What type of PR is this?
What this PR does / why we need it:
Now spiderpool integrates dra and cdi, which allows for some complex scheduling and better manipulation of hardware resources based on dra.
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer: