辅助kube-scheduler,对POD进行定期的二次打散整理,确保集群尽量调度均衡。
三种模式,Job,CronJob,Deployment
Job/CronJob 很好理解,执行的时间也可控
通过设置策略来控制做哪些驱逐逻辑,但是可以过滤namespace、节点和类型
目前有10种策略,最有用就下面几种
该策略确保只有一个和 Pod 关联的 RS、Deployment 或者 Job 资源对象运行在同一节点上。如果还有更多的 Pod 则将这些重复的 Pod 进行驱逐,以便更好地在集群中分散 Pod。
如果node上的cpu/memory/pods的request超过targetThresholds阈值,则认为该node负载太高,会尝试将其上的pod进行迁移,迁移的目标node需要低于thresholds阈值,这些node被认为是负载太低。
注意:
-
thresholds的判断是必须cpu+mem+pod三个条件(都是百分比)同时满足才算作”低负载node”,其cpu/memory依据是request值。
-
targetThresholds判断依据是cpu or mem or pod任意一个超过阈值才算作”高负载node”。
-
pod会从”高负载node” 迁移至 “低负载node”,不满足2个条件的node不会做均衡处理。
-
如果“高负载node”或者“低负载node”任一个数量为0,这个策略立即被终止
找到未充分利用的节点并将 Pod 从节点中逐出,以期将这些 Pod 紧凑地调度到更少的节点中。与节点自动缩放结合使用,此策略旨在帮助触发未充分利用的节点的缩减。此策略必须与调度程序评分策略MostAllocated
一起使用。
其他和上面的LowNodeUtilization相同,只是驱逐的对象不同,该策略驱逐“低负载”节点
- 关键性 Pod 不会被驱逐,比如 priorityClassName 设置为 system-cluster-critical 或 system-node-critical 的 Pod
- 不属于 RS、Deployment 或 Job 管理的 Pods 不会被驱逐
- DaemonSet 创建的 Pods 不会被驱逐
- 使用 LocalStorage 的 Pod 不会被驱逐,除非设置 evictLocalStoragePods: true
- 具有 PVC 的 Pods 不会被驱逐,除非设置 ignorePvcPods: true
- 在 LowNodeUtilization 和 RemovePodsViolatingInterPodAntiAffinity 策略下,Pods 按优先级从低到高进行驱逐,如果优先级相同,Besteffort 类型的 Pod 要先于 Burstable 和 Guaranteed 类型被驱逐
- annotations 中带有 descheduler.alpha.kubernetes.io/evict 字段的 Pod 都可以被驱逐,该注释用于覆盖阻止驱逐的检查,用户可以选择驱逐哪个Pods
- 如果 Pods 驱逐失败,可以设置 --v=4 从 descheduler 日志中查找原因,如果驱逐违反 PDB 约束,则不会驱逐这类 Pods
Deployment部署时的执行时间
- 在启动参数里面指定执行间隔
command:
- "/bin/descheduler"
args:
- "--policy-config-file"
- "/policy-dir/policy.yaml"
- "--descheduling-interval"
- "5m"
- "--v"
- "3"
LowNodeUtilization能保证一定从高负载到低负载?
这应该就是为什么使用Request作为依据,这样就能和kube-scheduler规则保持一致,保证到“低负载”节点,如果我们换成实际CPU利用率,得考虑这部分偏差
type DeschedulerServer struct {
componentconfig.DeschedulerConfiguration
Client clientset.Interface
Logs *logs.Options
SecureServing *apiserveroptions.SecureServingOptionsWithLoopback
DisableMetrics bool
}
type DeschedulerConfiguration struct {
metav1.TypeMeta
// deployment部署时候的执行周期
DeschedulingInterval time.Duration
KubeconfigFile string
// PolicyConfigFile is the filepath to the descheduler policy configuration.
PolicyConfigFile string
// Dry run
DryRun bool
// Node selectors
NodeSelector string
// MaxNoOfPodsToEvictPerNode restricts maximum of pods to be evicted per node.
MaxNoOfPodsToEvictPerNode int
// EvictLocalStoragePods allows pods using local storage to be evicted.
EvictLocalStoragePods bool
// IgnorePVCPods sets whether PVC pods should be allowed to be evicted
IgnorePVCPods bool
// Logging specifies the options of logging.
// Refer [Logs Options](https://github.com/kubernetes/component-base/blob/master/logs/options.go) for more information.
Logging componentbaseconfig.LoggingConfiguration
}
type DeschedulerStrategy struct {
// Enabled or disabled
Enabled bool
// Weight
Weight int
Params *StrategyParameters
}
//策略参数,基本这些参数对应了可以配置的各种策略
type StrategyParameters struct {
NodeResourceUtilizationThresholds *NodeResourceUtilizationThresholds
NodeAffinityType []string
PodsHavingTooManyRestarts *PodsHavingTooManyRestarts
PodLifeTime *PodLifeTime
RemoveDuplicates *RemoveDuplicates
FailedPods *FailedPods
IncludeSoftConstraints bool
Namespaces *Namespaces
ThresholdPriority *int32
ThresholdPriorityClassName string
LabelSelector *metav1.LabelSelector
NodeFit bool
}
常规的 cmd/descheduler/descheduler.go
中的 main
函数
NewDeschedulerCommand()
做cobra初始化参数接收,走 run()
-> RunDeschedulerStrategies()
Run()
里面就初始化了clientset和策略
func Run(rs *options.DeschedulerServer) error {
······
rsclient, err := client.CreateClient(rs.KubeconfigFile)
······
deschedulerPolicy, err := LoadPolicyConfig(rs.PolicyConfigFile)
······
stopChannel := make(chan struct{})
return RunDeschedulerStrategies(ctx, rs, deschedulerPolicy, evictionPolicyGroupVersion, stopChannel)
}
RunDeschedulerStrategies()
就是流程的主要函数,核心就是一个循环函数,周期就是启动命令里面配置的周期
wait.Until(func() {
······
}, rs.DeschedulingInterval, stopChannel)
最主要的循环,就是去自定义策略列表里面循环,然后去 strategyFuncs
里面去匹配,strategyFuncs
里面注册了回调函数,然后如果策略是enable就直接调用该回调函数
for name, strategy := range deschedulerPolicy.Strategies {
if f, ok := strategyFuncs[name]; ok {
if strategy.Enabled {
f(ctx, rs.Client, strategy, nodes, podEvictor)
}
} else {
klog.ErrorS(fmt.Errorf("unknown strategy name"), "skipping strategy", "strategy", name)
}
}
下面以某一个规则举例,来看看里面的规则处理流程
func LowNodeUtilization(ctx context.Context, client clientset.Interface, strategy api.DeschedulerStrategy, nodes []*v1.Node, podEvictor *evictions.PodEvictor) {
// 判断参数是否合理
if err := validateNodeUtilizationParams(strategy.Params); err != nil {
······
thresholdPriority, err := utils.GetPriorityFromStrategyParams(ctx, client, strategy.Params)
······
// 和获取门限值,下面还有一些初始化赋值
thresholds := strategy.Params.NodeResourceUtilizationThresholds.Thresholds
targetThresholds := strategy.Params.NodeResourceUtilizationThresholds.TargetThresholds
······
resourceNames := getResourceNames(thresholds)
// 这个函数是整个策略的重点
lowNodes, sourceNodes := classifyNodes(
getNodeUsage(ctx, client, nodes, thresholds, targetThresholds, resourceNames),
// 该函数即lowThresholdFilter
func(node *v1.Node, usage NodeUsage) bool {
if nodeutil.IsNodeUnschedulable(node) {
klog.V(2).InfoS("Node is unschedulable, thus not considered as underutilized", "node", klog.KObj(node))
return false
}
return isNodeWithLowUtilization(usage)
},
// 该函数即highThresholdFilter
func(node *v1.Node, usage NodeUsage) bool {
return isNodeAboveTargetUtilization(usage)
},
)
// log message in one line
keysAndValues := []interface{}{
"CPU", thresholds[v1.ResourceCPU],
"Mem", thresholds[v1.ResourceMemory],
"Pods", thresholds[v1.ResourcePods],
}
for name := range thresholds {
if !isBasicResource(name) {
keysAndValues = append(keysAndValues, string(name), int64(thresholds[name]))
}
}
// log message in one line
keysAndValues = []interface{}{
"CPU", targetThresholds[v1.ResourceCPU],
"Mem", targetThresholds[v1.ResourceMemory],
"Pods", targetThresholds[v1.ResourcePods],
}
for name := range targetThresholds {
if !isBasicResource(name) {
keysAndValues = append(keysAndValues, string(name), int64(targetThresholds[name]))
}
}
if len(lowNodes) == 0 {
klog.V(1).InfoS("No node is underutilized, nothing to do here, you might tune your thresholds further")
return
}
if len(lowNodes) <= strategy.Params.NodeResourceUtilizationThresholds.NumberOfNodes {
klog.V(1).InfoS("Number of nodes underutilized is less or equal than NumberOfNodes, nothing to do here", "underutilizedNodes", len(lowNodes), "numberOfNodes", strategy.Params.NodeResourceUtilizationThresholds.NumberOfNodes)
return
}
if len(lowNodes) == len(nodes) {
klog.V(1).InfoS("All nodes are underutilized, nothing to do here")
return
}
if len(sourceNodes) == 0 {
klog.V(1).InfoS("All nodes are under target utilization, nothing to do here")
return
}
evictable := podEvictor.Evictable(evictions.WithPriorityThreshold(thresholdPriority), evictions.WithNodeFit(nodeFit))
// stop if node utilization drops below target threshold or any of required capacity (cpu, memory, pods) is moved
continueEvictionCond := func(nodeUsage NodeUsage, totalAvailableUsage map[v1.ResourceName]*resource.Quantity) bool {
if !isNodeAboveTargetUtilization(nodeUsage) {
return false
}
for name := range totalAvailableUsage {
if totalAvailableUsage[name].CmpInt64(0) < 1 {
return false
}
}
return true
}
evictPodsFromSourceNodes(
ctx,
sourceNodes,
lowNodes,
podEvictor,
evictable.IsEvictable,
resourceNames,
"LowNodeUtilization",
continueEvictionCond)
}
classifyNodes()
函数是重点函数,该函数区分了高低利用率的节点,返回的是节点组列表,入参传入了2个函数指针,里面区分高低利用率节点就是利用传入的函数指针,返回了2个NodeUsage
结构体数组