-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support interpodaffinity and podtopologyspread #61
Conversation
373d7da
to
743b782
Compare
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/podtopologyspread/podtopologyspread.go
Outdated
Show resolved
Hide resolved
// `maxSkew-1` is added to the score so that differences between topology | ||
// domains get watered down, controlling the tolerance of the score to skews. | ||
func scoreForCount(cnt int64, maxSkew int32, tpWeight float64) float64 { | ||
return float64(cnt)*tpWeight + float64(maxSkew-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What are the parameters maxSkew
and tpWeight
for? What will be the difference if these two parameters are removed (i.e. score = cnt)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the original calculation method of k8s. I am not very clear about why it is calculated in this way. I can try to explain it as follows:
maxSkew
: represents the maximum imbalance value allowed by this constraint,maxSkew = the maximum number of pods in a topology domain - the minimum number of pods in a topology domain
tpWeight
: represents the number of filtered nodestpWeight = (log(len(filterNodes+2)))
cnt
: the number of matched pods
In this calculation formula, float64(cnt)*tpWeight
shows that when the number of pods is the same, it is more inclined to schedule to the topology domain with more nodes, and +float64(maxSkew-1)
in the formula is a small factor, which only represents the difference between different constraints to a certain extent, it also can make the scheduling more stable when the pods scheduled at the beginning are very small
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
I manually tested the InterPodAffinity and PodTopologySpread scheduling plug-ins, and both passed the verification. The document (Chinese version) can be viewed here: https://xeh61jru4u.feishu.cn/docx/NwrMdx3Hjoi47fxXJf4cXNNOnof. |
return framework.NewStatus(framework.Error, err.Error()) | ||
} | ||
|
||
state := utils.GetPreFilterState(pod, nodeInfos, constraints) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an uncommon case: the pod can be scheduled only after preempting the victims. In this case, the removal of the victims should be awared in the state, just as the AddPod
and RemovePod
functions of InterPodAffininity/PodTopologySpread plugins do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean I need to add a RemoveVictim
function? How does this differ from the existing RemovePod
function?
Please also attach the benchmark testing doc for other reviewers. |
I used kind+kwok to simulate 5,000 nodes and tested the performance of the two scheduling plugins, InterPodAffinity and PodTopologySpread. Here is the detailed test document (Chinese version): https://xeh61jru4u.feishu.cn/docx/JhNIdmyiZo19Ibxns2ucin0OnZd The test results are as follows:
|
|
||
func (pl *PodTopologySpreadCheck) getTopologyCondition(pod *v1.Pod) (*TopologySpreadCondition, error) { | ||
constraints := []podtopologyspreadScheduler.TopologySpreadConstraint{} | ||
allNodes, err := pl.frameworkHandle.SharedInformerFactory().Core().V1().Nodes().Lister().List(labels.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now I know that a node can be an NMNode, which means it can be managed by the node manager. So when getting all nodes here, we should get nodeInfo instead of just getting nodes managed by kubelet. But I am not sure how to get all nodeInfo through frameworkHandle. I have not found any relevant code so far. If you can tell me how to get it, it will be very helpful to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhh~I find that I can get NMNodes by BinderFrameworkHandle.CRDClientSet
. I will fix related code in binder.
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
pkg/binder/framework/plugins/interpodaffinity/interpodaffinity.go
Outdated
Show resolved
Hide resolved
a6196c9
to
d1a9442
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work, but there are some comments
pkg/binder/cache/cache.go
Outdated
@@ -34,6 +34,7 @@ import ( | |||
commoncache "github.com/kubewharf/godel-scheduler/pkg/common/cache" | |||
commonstore "github.com/kubewharf/godel-scheduler/pkg/common/store" | |||
framework "github.com/kubewharf/godel-scheduler/pkg/framework/api" | |||
schedulerCache "github.com/kubewharf/godel-scheduler/pkg/scheduler/cache" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We shouldn't import scheduler cache in binder.
Move NodeSlice to util pkg if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have move the NodeSlice
to public pkg/common/store
.
@@ -217,3 +227,7 @@ func (cache *binderCache) FindStore(storeName commonstore.StoreName) commonstore | |||
defer cache.mu.RUnlock() | |||
return cache.CommonStoresSwitch.Find(storeName) | |||
} | |||
|
|||
func (cache *binderCache) List() []framework.NodeInfo { | |||
return append(cache.nodeSlices.InPartitionNodeSlice.Nodes(), cache.nodeSlices.OutOfPartitionNodeSlice.Nodes()...) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need lock.
BTW it's dangerous to expose all of the nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I have add lock.
Because the binder needs all the nodeInfo information when performing topology checks for InterPodAffinity
and PodTopologySpread
, I added this function. Is there a safer method?
@@ -67,7 +69,10 @@ func DefaultUnitQueueSortFunc() framework.UnitLessFunc { | |||
func NewBasePlugins(victimsCheckingPlugins []*framework.VictimCheckingPluginCollectionSpec) *apis.BinderPluginCollection { | |||
// TODO add some default plugins later | |||
basicPlugins := apis.BinderPluginCollection{ | |||
CheckTopology: []string{}, | |||
CheckTopology: []string{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add these plugins on demand. We don't need to prepare date when the incoming Pod doesn't have cross-node constraints.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After considering the InterPodAffinity
constraint, the antiAffinity of the existing pod will affect the pod to be checked, so we need to check InterPodAffinity
, and it is difficult to determine in advance whether InterPodAffinity
is needed. The approach here is to check InterPodAffinity
and PodTopologySpread
by default, but in the check, we will check in advance whether the pod to be checked really has this constraint, and return quickly if not.
if status != nil { | ||
return status | ||
} | ||
// topologyLabels := nodeInfo.GetNodeLabels(podLauncher) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
}, nil | ||
} | ||
|
||
// func (pl *InterPodAffinity) getNodesWithSameTopologyLabels(topologyLabels map[string]string) ([]framework.NodeInfo, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cleanup
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -311,7 +311,7 @@ func (m *PodInfoMaintainer) ReservedPodsNum() int { | |||
func (m *PodInfoMaintainer) GetPods() []*PodInfo { | |||
pods := make([]*PodInfo, 0, m.bePodsMayBePreempted.Len()+m.gtPodsMayBePreempted.Len()+len(m.neverBePreempted)) | |||
rangeSplay := func(s splay.Splay) { | |||
s.Range(func(so splay.StoredObj) { | |||
s.RangeNoOrder(func(so splay.StoredObj) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do not modify underlying basic implementation for specific plugins
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have cancelled this commit.
The reason I modified this part before was that InterPodAffinity
and PodTopologySpread
plugins need to frequently obtain nodeInfo pods, and the original design was to obtain all pods through the dfs method, which had poor performance due to frequent recursion.
Maybe in the future I can try a more elegant method to optimize the performance of GetPods through another PR.
@@ -38,126 +36,6 @@ func makePriority(priority int32) *int32 { | |||
return &priority | |||
} | |||
|
|||
func TestPodInfoMaintainer_GetPods(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why delete this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have cancelled this commit.
The reason I deleted this is because I changed GetPods to unordered for higher performance, and this test requires GetPods to be ordered.
pkg/plugins/helper/node_info.go
Outdated
} | ||
|
||
for _, node := range allV1Nodes { | ||
nodeInfo := frameworkHandle.GetNodeInfo(node.Name) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeInfo may be nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is no longer used. I have deleted it.
pkg/plugins/helper/node_info.go
Outdated
nodeInfoMap := map[string]framework.NodeInfo{} | ||
for _, podLauncher := range podutil.PodLanucherTypes { | ||
if podLauncher == podutil.Kubelet && frameworkHandle.SharedInformerFactory() != nil { | ||
allV1Nodes, err := frameworkHandle.SharedInformerFactory().Core().V1().Nodes().Lister().List(labels.Everything()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand, why not schedule pods based on Cache?
cc @rookie0080
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohhh~This is my oversight. Actually, this is the previous implementation and this function is no longer used. Now I have changed to use cache. I will delete this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is no longer used. I have deleted it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a comment. Otherwise lgtm
pkg/common/store/nodeslices.go
Outdated
|
||
var GlobalNodeInfoPlaceHolder = framework.NewNodeInfo() | ||
|
||
type NodeSlices struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please move these code to pkg/framework/api/nodeinfo_hashslice.go
The data structure is not bound with the CommonStore Mechanism.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I have move it to pkg/framework/api/nodeinfo_hashslice.go
.
lgtm, please squash the commits cc @rookie0080 |
I am the participant of the godel-scheduler project in OSPP. I have added the interpodaffinity and podtopologyspread plugins and related test files to godel-scheduler as required.
Considering the overall structure of the godel-scheduler project:
I have completed testing the basic functions locally. If there are any areas that need improvement, I would appreciate your suggestions.