Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support interpodaffinity and podtopologyspread #61

Merged
merged 1 commit into from
Oct 31, 2024

Conversation

slipegg
Copy link

@slipegg slipegg commented Aug 6, 2024

I am the participant of the godel-scheduler project in OSPP. I have added the interpodaffinity and podtopologyspread plugins and related test files to godel-scheduler as required.

Considering the overall structure of the godel-scheduler project:

I have completed testing the basic functions locally. If there are any areas that need improvement, I would appreciate your suggestions.

@CLAassistant
Copy link

CLAassistant commented Aug 6, 2024

CLA assistant check
All committers have signed the CLA.

// `maxSkew-1` is added to the score so that differences between topology
// domains get watered down, controlling the tolerance of the score to skews.
func scoreForCount(cnt int64, maxSkew int32, tpWeight float64) float64 {
return float64(cnt)*tpWeight + float64(maxSkew-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are the parameters maxSkew and tpWeight for? What will be the difference if these two parameters are removed (i.e. score = cnt)?

Copy link
Author

@slipegg slipegg Sep 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the original calculation method of k8s. I am not very clear about why it is calculated in this way. I can try to explain it as follows:

  • maxSkew: represents the maximum imbalance value allowed by this constraint, maxSkew = the maximum number of pods in a topology domain - the minimum number of pods in a topology domain
  • tpWeight: represents the number of filtered nodes tpWeight = (log(len(filterNodes+2)))
  • cnt: the number of matched pods

In this calculation formula, float64(cnt)*tpWeight shows that when the number of pods is the same, it is more inclined to schedule to the topology domain with more nodes, and +float64(maxSkew-1) in the formula is a small factor, which only represents the difference between different constraints to a certain extent, it also can make the scheduling more stable when the pods scheduled at the beginning are very small

@slipegg
Copy link
Author

slipegg commented Oct 14, 2024

I manually tested the InterPodAffinity and PodTopologySpread scheduling plug-ins, and both passed the verification. The document (Chinese version) can be viewed here: https://xeh61jru4u.feishu.cn/docx/NwrMdx3Hjoi47fxXJf4cXNNOnof.

return framework.NewStatus(framework.Error, err.Error())
}

state := utils.GetPreFilterState(pod, nodeInfos, constraints)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an uncommon case: the pod can be scheduled only after preempting the victims. In this case, the removal of the victims should be awared in the state, just as the AddPod and RemovePod functions of InterPodAffininity/PodTopologySpread plugins do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean I need to add a RemoveVictim function? How does this differ from the existing RemovePod function?

@rookie0080
Copy link
Contributor

I manually tested the InterPodAffinity and PodTopologySpread scheduling plug-ins, and both passed the verification. The document (Chinese version) can be viewed here: https://xeh61jru4u.feishu.cn/docx/NwrMdx3Hjoi47fxXJf4cXNNOnof.

Please also attach the benchmark testing doc for other reviewers.

@slipegg
Copy link
Author

slipegg commented Oct 28, 2024

I manually tested the InterPodAffinity and PodTopologySpread scheduling plug-ins, and both passed the verification. The document (Chinese version) can be viewed here: https://xeh61jru4u.feishu.cn/docx/NwrMdx3Hjoi47fxXJf4cXNNOnof.

Please also attach the benchmark testing doc for other reviewers.

I used kind+kwok to simulate 5,000 nodes and tested the performance of the two scheduling plugins, InterPodAffinity and PodTopologySpread. Here is the detailed test document (Chinese version): https://xeh61jru4u.feishu.cn/docx/JhNIdmyiZo19Ibxns2ucin0OnZd

The test results are as follows:

Constraint Type Case Maximum e2e Scheduling Speed (pods/s)
Filter (Hard Constraints) MaxSkew of 1 for topology uniform distribution at region level 151
Inter-pod affinity constraints at region level 107
Inter-pod anti-affinity constraints at region level 99.5
Inter-pod anti-affinity constraints with existing Pods at region level 123
Score (Soft Constraints) PodTopologySpread with MaxSkew of 1 for topology uniform distribution at region level 383 (Flattened Dispatcher)
Inter-pod affinity constraints at region level 336 (Flattened Dispatcher)
Inter-pod anti-affinity constraints at region level 342 (Flattened Dispatcher)
Inter-pod anti-affinity constraints with existing Pods at region level 314 (Flattened Dispatcher)


func (pl *PodTopologySpreadCheck) getTopologyCondition(pod *v1.Pod) (*TopologySpreadCondition, error) {
constraints := []podtopologyspreadScheduler.TopologySpreadConstraint{}
allNodes, err := pl.frameworkHandle.SharedInformerFactory().Core().V1().Nodes().Lister().List(labels.Everything())
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now I know that a node can be an NMNode, which means it can be managed by the node manager. So when getting all nodes here, we should get nodeInfo instead of just getting nodes managed by kubelet. But I am not sure how to get all nodeInfo through frameworkHandle. I have not found any relevant code so far. If you can tell me how to get it, it will be very helpful to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh~I find that I can get NMNodes by BinderFrameworkHandle.CRDClientSet. I will fix related code in binder.

test/e2e/scheduling/hard_constraints.go Show resolved Hide resolved
@slipegg slipegg force-pushed the develop branch 2 times, most recently from a6196c9 to d1a9442 Compare October 28, 2024 12:45
Copy link
Member

@binacs binacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work, but there are some comments

@@ -34,6 +34,7 @@ import (
commoncache "github.com/kubewharf/godel-scheduler/pkg/common/cache"
commonstore "github.com/kubewharf/godel-scheduler/pkg/common/store"
framework "github.com/kubewharf/godel-scheduler/pkg/framework/api"
schedulerCache "github.com/kubewharf/godel-scheduler/pkg/scheduler/cache"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We shouldn't import scheduler cache in binder.

Move NodeSlice to util pkg if needed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have move the NodeSlice to public pkg/common/store.

@@ -217,3 +227,7 @@ func (cache *binderCache) FindStore(storeName commonstore.StoreName) commonstore
defer cache.mu.RUnlock()
return cache.CommonStoresSwitch.Find(storeName)
}

func (cache *binderCache) List() []framework.NodeInfo {
return append(cache.nodeSlices.InPartitionNodeSlice.Nodes(), cache.nodeSlices.OutOfPartitionNodeSlice.Nodes()...)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need lock.

BTW it's dangerous to expose all of the nodes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have add lock.

Because the binder needs all the nodeInfo information when performing topology checks for InterPodAffinity and PodTopologySpread, I added this function. Is there a safer method?

@@ -67,7 +69,10 @@ func DefaultUnitQueueSortFunc() framework.UnitLessFunc {
func NewBasePlugins(victimsCheckingPlugins []*framework.VictimCheckingPluginCollectionSpec) *apis.BinderPluginCollection {
// TODO add some default plugins later
basicPlugins := apis.BinderPluginCollection{
CheckTopology: []string{},
CheckTopology: []string{
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: add these plugins on demand. We don't need to prepare date when the incoming Pod doesn't have cross-node constraints.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After considering the InterPodAffinity constraint, the antiAffinity of the existing pod will affect the pod to be checked, so we need to check InterPodAffinity, and it is difficult to determine in advance whether InterPodAffinity is needed. The approach here is to check InterPodAffinity and PodTopologySpread by default, but in the check, we will check in advance whether the pod to be checked really has this constraint, and return quickly if not.

if status != nil {
return status
}
// topologyLabels := nodeInfo.GetNodeLabels(podLauncher)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}, nil
}

// func (pl *InterPodAffinity) getNodesWithSameTopologyLabels(topologyLabels map[string]string) ([]framework.NodeInfo, error) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleanup

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -311,7 +311,7 @@ func (m *PodInfoMaintainer) ReservedPodsNum() int {
func (m *PodInfoMaintainer) GetPods() []*PodInfo {
pods := make([]*PodInfo, 0, m.bePodsMayBePreempted.Len()+m.gtPodsMayBePreempted.Len()+len(m.neverBePreempted))
rangeSplay := func(s splay.Splay) {
s.Range(func(so splay.StoredObj) {
s.RangeNoOrder(func(so splay.StoredObj) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not modify underlying basic implementation for specific plugins

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have cancelled this commit.

The reason I modified this part before was that InterPodAffinity and PodTopologySpread plugins need to frequently obtain nodeInfo pods, and the original design was to obtain all pods through the dfs method, which had poor performance due to frequent recursion.

Maybe in the future I can try a more elegant method to optimize the performance of GetPods through another PR.

@@ -38,126 +36,6 @@ func makePriority(priority int32) *int32 {
return &priority
}

func TestPodInfoMaintainer_GetPods(t *testing.T) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have cancelled this commit.

The reason I deleted this is because I changed GetPods to unordered for higher performance, and this test requires GetPods to be ordered.

}

for _, node := range allV1Nodes {
nodeInfo := frameworkHandle.GetNodeInfo(node.Name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NodeInfo may be nil

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is no longer used. I have deleted it.

nodeInfoMap := map[string]framework.NodeInfo{}
for _, podLauncher := range podutil.PodLanucherTypes {
if podLauncher == podutil.Kubelet && frameworkHandle.SharedInformerFactory() != nil {
allV1Nodes, err := frameworkHandle.SharedInformerFactory().Core().V1().Nodes().Lister().List(labels.Everything())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand, why not schedule pods based on Cache?

cc @rookie0080

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ohhh~This is my oversight. Actually, this is the previous implementation and this function is no longer used. Now I have changed to use cache. I will delete this function.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is no longer used. I have deleted it.

@slipegg slipegg changed the base branch from main to dev/pod-affinity October 29, 2024 12:23
Copy link
Member

@binacs binacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a comment. Otherwise lgtm


var GlobalNodeInfoPlaceHolder = framework.NewNodeInfo()

type NodeSlices struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move these code to pkg/framework/api/nodeinfo_hashslice.go

The data structure is not bound with the CommonStore Mechanism.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I have move it to pkg/framework/api/nodeinfo_hashslice.go.

@binacs
Copy link
Member

binacs commented Oct 30, 2024

lgtm, please squash the commits

cc @rookie0080

@NickrenREN NickrenREN merged commit cb8907b into kubewharf:dev/pod-affinity Oct 31, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants