Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kdoctor agent schedule #137

Merged
merged 4 commits into from
Aug 24, 2023
Merged

kdoctor agent schedule #137

merged 4 commits into from
Aug 24, 2023

Conversation

Icarus9913
Copy link
Contributor

No description provided.

@codecov
Copy link

codecov bot commented Aug 11, 2023

Codecov Report

Merging #137 (c5368a7) into main (77b1679) will decrease coverage by 1.69%.
Report is 6 commits behind head on main.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #137      +/-   ##
==========================================
- Coverage   40.63%   38.94%   -1.69%     
==========================================
  Files           8        8              
  Lines         507      529      +22     
==========================================
  Hits          206      206              
- Misses        296      318      +22     
  Partials        5        5              
Flag Coverage Δ
unittests 38.94% <0.00%> (-1.69%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Changed Coverage Δ
pkg/reportManager/manager.go 0.00% <0.00%> (ø)
pkg/reportManager/worker.go 17.69% <0.00%> (-3.41%) ⬇️

@Icarus9913 Icarus9913 force-pushed the feat/wk/schedule-v2 branch 2 times, most recently from 1240af1 to 2a64ca0 Compare August 11, 2023 06:38
@Icarus9913 Icarus9913 changed the title Feat/wk/schedule v2 kdoctor agent schedule Aug 11, 2023
Signed-off-by: Icarus9913 <icaruswu66@qq.com>

| Case ID | Title | Priority | Smoke | Status | Other |
|---------|-------------------------------------------------------------------|----------|-------|--------|-------------|
| E00001 | Successfully testing Task Runtime creation | p1 | | | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个描述 可以更细化下,本身就是希望 看描述 知道有了哪些用例

例如,它是建立 哪个 CRD 还是 所有 ? 它是建立 deployment 还是 daemonset ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个用例有补充,还在写 没有提上来

|---------|-------------------------------------------------------------------|----------|-------|--------|-------------|
| E00001 | Successfully testing Task Runtime creation | p1 | | | |
| E00002 | Successfully testing Task Runtime Service creation | p1 | | | |
| E00003 | Successfully testing cascading deletion with Task Runtime Service | p1 | | | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, 有没有用例校验 ,包括 资源的创建、删除时间、status 中间状态的装换 等 符合 spec 中的预期 ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个用例有补充,还在写 没有提上来


| 字段 | 描述 | 结构 | 验证 | 取值 | 默认值 |
|-------------------------------|------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------|-----|----------------------|-----------|
| annotation | agent 工作负载的 annotation 配合搭配 multus 多网卡使用 | map[string]string | 可选 | | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

配合搭配 multus 多网卡使用
这个就不要 限定误导

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -4,7 +4,7 @@

## 介绍

对于这种任务,每个 kdoctor agent 都会向指定的目标发送 http 请求,默认并发量为 50 可覆盖多副本情况,并发量可在 kodcotr 的 configmap 中设置,并获得成功率和平均延迟。根据成功条件来判断结果是否成功。并且,可以通过聚合API获取详细的报告。
对于这种任务, kdoctor-controller 会根据 agentSpec 生成对应的 agent ,每一个 agent pod 都会向指定的目标发送 http 请求,默认并发量为 50 可覆盖多副本情况,并发量可在 kodcotr 的 configmap 中设置,并获得成功率和平均延迟。根据成功条件来判断结果是否成功。并且,可以通过聚合API获取详细的报告。

Copy link
Collaborator

@weizhoublue weizhoublue Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

资源的创建,包括哪些 ? deloyment 和 service
资源的删除逻辑是什么?
报告的收取的逻辑是什么?删除后是否影响 报告 ?deployment 删除和 CR 删除,和 报告保留是什么关系

建议单独用一个 md 说明这些,每个 CRD 中 使用 引用 跳转

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -67,6 +67,10 @@ kind: AppHttpHealthy
metadata:
name: http1
spec:
agentSpec:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get-started就是要精简,按默认工作即可 ,不要写这些

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| env | agent 工作负载环境变量 | [env](https://github.com/kubernetes/kubernetes/blob/v1.27.0/staging/src/k8s.io/api/core/v1/types.go#L2012) | 可选 | | |
| hostNetwork | agent 工作负载是否使用宿主机网络 | bool | 可选 | true、false | false |
| resources | agent 工作负载资源使用配置 | [resources](https://github.com/kubernetes/kubernetes/blob/v1.27.0/staging/src/k8s.io/api/core/v1/types.go#L2333) | 可选 | | |
| terminationGracePeriodMinutes | agent 工作负载完成任务后多少分钟之后终止 | int | 可选 | 大于等于 0 | 60 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个默认值 60 在 chart values 中可定制,更合适

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| affinity | agent 工作负载亲和性 | [labelSelector](https://github.com/kubernetes/kubernetes/blob/v1.27.0/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1195) | 可选 | | |
| env | agent 工作负载环境变量 | [env](https://github.com/kubernetes/kubernetes/blob/v1.27.0/staging/src/k8s.io/api/core/v1/types.go#L2012) | 可选 | | |
| hostNetwork | agent 工作负载是否使用宿主机网络 | bool | 可选 | true、false | false |
| resources | agent 工作负载资源使用配置 | [resources](https://github.com/kubernetes/kubernetes/blob/v1.27.0/staging/src/k8s.io/api/core/v1/types.go#L2333) | 可选 | | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认是什么?无资源限制

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

默认cpu:100m.mem: 128Mi
在 chart value 中可以设置

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我指文档 要 补充,不是 comment 中 告知我

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

| E00001 | Successfully testing Task Runtime creation | p1 | | | |
| E00002 | Successfully testing Task Runtime Service creation | p1 | | | |
| E00003 | Successfully testing cascading deletion with Task Runtime Service | p1 | | | |
| E00004 | Successfully testing cascading deletion with Task Runtime | p1 | | | |
Copy link
Collaborator

@weizhoublue weizhoublue Aug 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1)这些资源 如果 中途被人 删除了 会发生什么? 业务代码 是否会崩,CRD status 会展示什么,是否 要加入 finalizer 或者 webhook 防范
(2)中途删除 CRD,期待什么,是否有用例

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有用例的

| Case ID | Title | Priority | Smoke | Status | Other |
|---------|-------------------------------------------------------------------|----------|-------|--------|-------------|
| E00001 | Successfully testing Task Runtime creation | p1 | | | |
| E00002 | Successfully testing Task Runtime Service creation | p1 | | | |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

项目卸载时,卸载流程是否有影响,是否有残余CR或 deployment

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此功能暂时还没实现,等后面在做这个功能吧

Copy link
Collaborator

@weizhoublue weizhoublue Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个很简单,spidernet-io/spiderpool@b5b8919
参考下,一切 做了,或则 要在 文档中 体现用例

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 pr 里的东西太多了 重新搞个 pr 搞吧

@ii2day ii2day force-pushed the feat/wk/schedule-v2 branch 2 times, most recently from 7f0f50e to 925f202 Compare August 15, 2023 06:05
Signed-off-by: ii2day <ji.li@daocloud.io>
@ii2day ii2day force-pushed the feat/wk/schedule-v2 branch 6 times, most recently from 7935ea4 to 9002712 Compare August 16, 2023 03:45
@ii2day ii2day force-pushed the feat/wk/schedule-v2 branch 2 times, most recently from cb927e6 to 0738421 Compare August 16, 2023 10:32
@@ -73,6 +73,11 @@ spec:
- {{ .Values.kdoctorController.cmdBinName }}
args:
- --config-path=/tmp/config-map/conf.yml
- --configmap-deployment-template=/tmp/configmap-app-template/deployment.yml
Copy link
Collaborator

@weizhoublue weizhoublue Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个 成本是否有点高,以后 加一个配置 就要写个 命令行参数

这里是否给个 configmap name ,代码 get 自己去读
或者给个路径即可

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

代码量一样,全部扔到cofigmap里,后台代码读取后判断是否有值,然后再json.unmarshal给某个结构体实例,以后加一个配置,一样修改后台代码。

另此处,后台代码已做模版化处理,“验证,读取”全部流程化。

| kind | agent 工作负载的类型 | string | 可选 | Deployment、DaemonSet | DaemonSet |
| deploymentReplicas | agent 工作负载类型为 deployment 时的期望副本数 | int | 可选 | 大于等于 0 | 0 |
| affinity | agent 工作负载亲和性 | [labelSelector](https://github.com/kubernetes/kubernetes/blob/v1.27.0/staging/src/k8s.io/apimachinery/pkg/apis/meta/v1/types.go#L1195) | 可选 | | |
| env | agent 工作负载环境变量 | [env](https://github.com/kubernetes/kubernetes/blob/v1.27.0/staging/src/k8s.io/api/core/v1/types.go#L2012) | 可选 | | |
Copy link
Collaborator

@weizhoublue weizhoublue Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

不能贴代码,代码是会变动了,行会变化的
如果需要用户感知,那就需要有个 referent/agent.md 说明启动命令和环境变量等

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我看 spiderpool 中就是这么写的

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一样有版本问题

Copy link
Collaborator

@weizhoublue weizhoublue Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

我们是小项目,随着代码迭代, 行数变化,我们没有精力 长期 变更 这个链接,也没 CI check 这个链接行数是否正确。
并且 代码也没 对 环境变量的 文字说明,没人看得懂 这些 是什么意思

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Signed-off-by: ii2day <ji.li@daocloud.io>
@ii2day ii2day force-pushed the feat/wk/schedule-v2 branch 3 times, most recently from 7a2ba07 to daab7b3 Compare August 22, 2023 09:08
@weizhoublue
Copy link
Collaborator

之前谈到,务必最好给 agent 设置 limit 资源,避免 影响生产环境
chart values 可定义,有默认值


### 工作负载

工作负载为 DaemonSet 或 Deployment,默认为 Daemonset,负载中的每一个 Pod 根据任务配置进行的请求,并将执行结果落盘到 Pod 中,可通过 AgentSpec 中设置
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

只有所有 pod 就绪了,任务才 开始按照 spec 中的 时间 定义 启动

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

的销毁逻辑相同。

### Ingress

Copy link
Collaborator

@weizhoublue weizhoublue Aug 22, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1)关于 任务CR 的删除 、资源的优雅删除、报告的删除,三者间是什么关系,是否有个时序图 之类的表达 关系,运维才知道 它的操作 有什么影响
什么时候删除资源,什么时候删除 CR 是安全的

(2)任务的资源优雅删除 是怎么设计的,为什么需要这个

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@ii2day
Copy link
Collaborator

ii2day commented Aug 23, 2023

之前谈到,务必最好给 agent 设置 limit 资源,避免 影响生产环境 chart values 可定义,有默认值

done

@ii2day ii2day force-pushed the feat/wk/schedule-v2 branch 4 times, most recently from 4967054 to 5a453d0 Compare August 23, 2023 06:43
@@ -0,0 +1,59 @@
## runtime

当下发任务 CR 后,kdoctor-controller 会根据 CR 中的 AgentSpec 生成对应的任务载体(DaemonSet 或 Deployment)当所有 Pod 就绪后,开始按照 Spec 中的任务定义执行任务,每一个任务独立使用一个载体。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个文件在 doc/mkdoc 没有链接

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

workload ->>ingress: 到达 runtime 销毁时间,销毁 ingress
cr 任务 ->>kdoctor_controller: cr 任务删除
kdoctor_controller ->> workload: cr 任务删除,workload 删除
workload ->> pod: workload 删除,pod 删除
Copy link
Collaborator

@weizhoublue weizhoublue Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1)这个时序图中,好像还没 说清 报告的生命周期
(2)这个图的后续,是否可以给几个简单的结论,
报告的什么周期是什么(删除CR 是否意味着 它的报告也会被删除了 )

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

docs/mkdocs.yml Outdated
@@ -50,6 +50,7 @@ nav:
- AppHttpHealthy: reference/apphttphealthy.md
- NetReach: reference/netreach.md
- NetDns: reference/netdns.md
- Runtime: reference/runtime.md
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是 concept 章节 更合适吧?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done 已更改

@weizhoublue
Copy link
Collaborator

修复下 ci

Signed-off-by: ii2day <ji.li@daocloud.io>
@weizhoublue weizhoublue merged commit 27d2481 into main Aug 24, 2023
24 of 26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants