Skip to content

Conversation

@qi-min
Copy link
Contributor

@qi-min qi-min commented Dec 11, 2025

What type of PR is this?

feature

What this PR does / why we need it:

Add a new scheduler to execute fast path scheduling for workload like AI agent. The Scheduler is focus fast scheduling and scheduling based on Sharding. This PR include basic function to schedule Agent.
For more details please refer to: proposal | design doc

Which issue(s) this PR fixes:

Fixes #
#4722

Special notes for your reviewer:

Does this PR introduce a user-facing change?


@volcano-sh-bot volcano-sh-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 11, 2025
@volcano-sh-bot volcano-sh-bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Dec 11, 2025
@gemini-code-assist
Copy link

Summary of Changes

Hello @qi-min, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a specialized vc-agent-scheduler to Volcano, aiming to optimize the scheduling of high-throughput AI agent workloads through a fast-path, sharding-based approach. It integrates a new NodeShard Custom Resource Definition for managing dedicated node groups and refines the scheduling queue with advanced priority handling. The changes span across build configurations, Helm chart integration, and a new plugin-based architecture, significantly extending Volcano's capabilities for diverse scheduling demands.

Highlights

  • New Agent Scheduler Introduction: A new vc-agent-scheduler component has been introduced, specifically designed for fast-path scheduling of workloads like AI agents, focusing on sharding-based scheduling.
  • Build System and Deployment Integration: The Makefile has been updated to include targets for building the new vc-agent-scheduler binary and its Docker image. The Volcano Helm chart now includes templates and configurations to deploy this new scheduler.
  • New Custom Resource Definition (CRD): A NodeShard CRD (nodeshards.shard.volcano.sh) has been added to manage collections of nodes dedicated to specific schedulers, enabling more granular control over resource allocation.
  • Enhanced Scheduling Queue: The agent scheduler incorporates a modified Kubernetes-derived scheduling queue, featuring priority handling for binding conflicts and removal of NominatedNode logic for streamlined operation.
  • Plugin Framework for Agent Scheduler: A dedicated plugin framework has been established for the agent scheduler, including predicates and nodeorder plugins, allowing for extensible and customizable scheduling policies.
  • Node Information and Snapshot Improvements: The NodeInfo structure in pkg/scheduler/api now includes Generation and BindGeneration fields to facilitate conflict checking during binding. The snapshot mechanism has been enhanced to include Volcano's NodeInfo and support incremental updates.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new agent scheduler designed for fast-path scheduling of workloads like AI agents. This is a significant feature addition that includes a new scheduler binary, associated CRDs like NodeShard, and a conflict-aware binding mechanism to improve scheduling reliability. The implementation thoughtfully reuses and adapts components from the existing Volcano scheduler framework. Additionally, the PR refactors the predicates and nodeorder plugins to better align with the upstream Kubernetes scheduler framework, which is a commendable improvement for long-term maintainability. My review identified a couple of minor issues, including an ungraceful panic on initialization error and a minor copy-paste error in a file header. Overall, this is a well-structured and substantial contribution towards enhancing Volcano's scheduling capabilities.

Comment on lines 25 to 49
"volcano.sh/apis/pkg/apis/helpers"
"volcano.sh/volcano/cmd/agent-scheduler/app/options"
scheduler "volcano.sh/volcano/pkg/agentscheduler"
"volcano.sh/volcano/pkg/agentscheduler/metrics"
"volcano.sh/volcano/pkg/kube"
"volcano.sh/volcano/pkg/signals"
commonutil "volcano.sh/volcano/pkg/util"

v1 "k8s.io/api/core/v1"
"k8s.io/apimachinery/pkg/util/uuid"
clientset "k8s.io/client-go/kubernetes"
"k8s.io/client-go/kubernetes/scheme"
corev1 "k8s.io/client-go/kubernetes/typed/core/v1"
"k8s.io/klog/v2"

utilfeature "k8s.io/apiserver/pkg/util/feature"

restclient "k8s.io/client-go/rest"
"k8s.io/client-go/tools/leaderelection"
"k8s.io/client-go/tools/leaderelection/resourcelock"
"k8s.io/client-go/tools/record"
basecompatibility "k8s.io/component-base/compatibility"

// Register rest client metrics
_ "k8s.io/component-base/metrics/prometheus/restclient"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: first go std lib and seconde group thirdparty pkg together, and at last group volcano pkg together

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, I changed the order

)

type nodeOrderPlugin struct {
type NodeOrderPlugin struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a refactor or any other changes, can you point it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For changes under scheduler/plugins are just refactor, like abstract functions, so that it can be used in both batch scheduler and agent scheduler.
Modification on line 69 is to allow this plugin resued in agent-scheduler package

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems a refactor, if there are any other change, please point it to simplify review

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change in predicate is redactor, to allow the functions can be used in both agent-scheduler and batch scheduler

node, found := nodeMap[nodeName]
if !found {
klog.Errorf("predicates, update pod %s/%s allocate from NOT EXIST node [%s]", pod.Namespace, pod.Name, nodeName)
node, err := pp.Handle.SnapshotSharedLister().NodeInfos().Get(nodeName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am concerned about this change, previously it get nodes from the session, now it get from the informer list.

Will it bring some caveats?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

}
}

cache := schedcache.New(config, opt.SchedulerNames, opt.NodeSelector, opt.NodeWorkerThreads, opt.ResyncPeriod)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we donot need multiple SchedulerNames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, changed it to use single name

worker := &Worker{}
worker.framework = framework.NewFramework(sched.actions, sched.tiers, sched.cache, sched.configurations)
index := i
go wait.Until(func() { worker.runOnce(index) }, 0, stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When there are no any pods to schedule, wouldn't it run into deadloop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When no pod to schedule, runOnce will be blocked at Popping pod from queue until new Pending pod enqueue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the queue is blocking read, that's ok

task := scheduleResult.SchedCtx.Task
// 1. Check conflict
node := binder.FindNonConflictingNode(scheduleResult)
if node == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how many pods are scheduled once, when could node == nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One pod is scheduled in each worker at once. The scheduleResult(candicate nodes, default is 3) are sent to binder from worker. FindNonConflictingNode check whether theses nodes confilct with previous binding. If the version of node is not used in previsous binding, node will be used for binding. Otherwise node will be rejected. Nil is returned if all candidates nodes are rejected.

E.g. candidate nodes are n1(v1), n2(v1), n3(v2). If nodes with these versions are used in previous binding, then all candidates will be dropped and nil returned. If previsous binding on n3 is at version v1, then n3 will be used.

default:
}
if len(binder.BindCheckChannel) == 0 {
break
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also wondering the cpu usage of not any pod need to be binded

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the function is updated, it will be blocked if no result sent to channel

return
}

binder.schedulingQueue.Done(task.Pod.UID) // Mark previous processing as done to clean in-flight records, avoid memory leak
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done should be called in pair with Pop, otherwise it is very easy to miss calling Done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although retries based on events are not yet implemented, if we call Done immediately after Pop,if scheduling fails, the pod will no longer be in the inflight list:

inFlightPod, ok := aq.inFlightPods[pInfo.Pod.UID]
if !ok {
return nil, fmt.Errorf("in flight Pod isn't found in the scheduling queue. If you see this error log, it's likely a bug in the scheduler")
}
var events []*clusterEvent
for event := inFlightPod.Next(); event != nil; event = event.Next() {
e, ok := event.Value.(*clusterEvent)
if !ok {
// Must be another in-flight Pod (*v1.Pod). Can be ignored.
continue
}
events = append(events, e)
}
return events, nil

Event changes that occur during the scheduling of this pod will prevent the pod from retrying quickly.
Currently, after Pop, Done is handled without omission. When scheduling fails, Done is called in AddUnschedulableIfNotPresent, and Done is also called when CheckAndBind is reached after successful scheduling, so we didn't miss it

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the different with pkg/scheduler/cache/cache.go. ducoment it and add a TODO: can we merge them or anstract the common parts

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In agent scheduler, we used informers, nodeinfo, nodeList, Binder from scheduler cache and also add/modify several fields and function like confictAwareBinder, scheduling queue, etc. I will add a TODO to abstract common parts from agent/batch scheduler

_ "volcano.sh/volcano/pkg/agentscheduler/plugins"

// init assert
_ "volcano.sh/volcano/pkg/scheduler/util/assert"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove, this should only occur in test

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

)

// ServerOption is the main context object for the controller manager.
type ServerOption struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is mostly copied from scheduler and RegisterOptions has some coupling with scheduler. set ServerOpts.EnableCSIStorage

Could we eliminate that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plugins in batch scheduler depend on options.ServerOpts, but ServerOpts is a global variable at cmd/scheduler /app/options level. So it is not inited with flag values when agent scheduler startup. RegisterOptions pass options to options.ServerOpts so that it can pick up value from flags of agent scheduler.
In next pr for sharing, the logic is changed to pass all options of agented scheduler to batch scheduler, no need to specially specify EnableCSIStorage
voptions.ServerOpts = ServerOpts.ServerOption


commonutil.LeaderElectionDefault(&s.LeaderElection)
s.LeaderElection.ResourceName = commonutil.GenerateComponentName(s.SchedulerNames)
componentbaseoptions.BindLeaderElectionFlags(&s.LeaderElection, fs)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer moving the leader elector related to app.Run

}

// Align default feature-gates with the connected cluster's version.
if err := setupComponentGlobals(config); err != nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this do

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems it is used to discover kube server version and set feature gate defaults to match the cluster

return err
}

metrics.InitKubeSchedulerRelatedMetrics()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backoffqueue depend on the metrics, but i donot think we ever update them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the scheduling queue we imported in third-party path depend on the metrics, nil error is reported if metrics are not inited. I think we need a TODO decouple the kube metrics and record metrics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just found batch scheduler also call this InitKubeSchedulerRelatedMetrics to init metrics few weeks ago for #4731

// Register registers or updates a binder for the given plugin name . The plugin can be such as preBinder or postBinder.
// It always overwrites the existing binder map to support plugin configuration updates
// during runtime, as plugins may be reconfigured without restarting the scheduler.
func (r *BinderRegistry) Register(name string, binder interface{}) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: this is not called, so volume binding check is not supported

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is called in OnPluginInit in predicate plugin

Comment on lines +32 to +35
OnActionInit(configurations []conf.Configuration)

// Initialize initializes the allocator plugins.
Initialize()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can merge

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hzxuzhonghu what need be merged?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean OnActionInit and OnActionInit can be merged

conf.EnabledActionMap[action.Name()] = true
}

schedCtx, err := worker.generateNextSchedulingContext()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can add a for loop here instead of depending on wait.Until(func() { worker.runOnce(index) }, 0, stopCh) to do that in function granularity, which could be less efficient and produce confusing logs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion, I will change it

worker := &Worker{}
worker.framework = framework.NewFramework(sched.actions, sched.tiers, sched.cache, sched.configurations)
index := i
go wait.Until(func() { worker.runOnce(index) }, 0, stopCh)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean the queue is blocking read, that's ok

SchedCtx: schedCtx,
BindContext: alloc.CreateBindContext(schedCtx),
}
alloc.EnqueueSchedulerResultForTask(result)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems we need a more clear name, like sendToBinder

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@qi-min qi-min changed the title [WIP]Create a new Agent scheduler for Agent workload fast scheduling Create a new Agent scheduler for Agent workload fast scheduling Dec 17, 2025
@volcano-sh-bot volcano-sh-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Dec 17, 2025
agentapi "volcano.sh/volcano/pkg/agentscheduler/api"
"volcano.sh/volcano/pkg/agentscheduler/framework"
"volcano.sh/volcano/pkg/scheduler/api"
vcache "volcano.sh/volcano/pkg/scheduler/cache"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agentscheduler replicates the implementations of cache and framework, but it also references the cache and framework under the scheduler directory here. What are the principles for distinguishing between the cache and framework components in these two directories?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some structure definition is reused from batch scheduler, that's why scheduler/cache is also imported. We need to abstract the common part of cache for batch/agent scheduler, so only common cache will be referenced in future. I will add a TODO

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is recommended to create an independent issue tracking system for optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PrePredicateFns map[string]api.PrePredicateFn
NodeOrderFns map[string]api.NodeOrderFn
BatchNodeOrderFns map[string]api.BatchNodeOrderFn
NodeMapFns map[string]api.NodeMapFn
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the two callback functions NodeMapFns and NodeReduceFns have specific scenario requirements in the agent-scheduler context, or are they merely intended to align with the existing scheduling logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to align with current scheduling logic so that the existing plugins can be reused


// Framework manages the scheduler plugins and their execution points.
type Framework struct {
*k8sutil.Framework // Embedding Framework to implement k8sframework.Handle interface
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The k8sutil package belongs to the common directory under scheduler/plugin, and it is currently intended to be referenced by multiple algorithm plugins. Is it appropriate for the current agentscheduler to reference it directly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*k8sutil.Framework defined some new fields and functions that are also need in agent scheduler, that's why it is referenced
snapshot *Snapshot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a good way to depend on each other. Normally, plugins should be able to reference the cache and framework packages, but the framework should not depend on the plugin packages in return. It is recommended to register a separate issue for rectification and optimization.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sched.tiers = tiers
sched.configurations = configurations
sched.metricsConf = metricsConf
sched.mutex.Unlock()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be better to use defer sched.mutex.Unlock() here, to avoid the risk of deadlock caused by the continuous addition of code in the future?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK

0YHR0 and others added 7 commits December 18, 2025 16:52
Signed-off-by: Haoran <97868579@qq.com>
Signed-off-by: handan-yxh <yxh953167434@163.com>
… deploy

Signed-off-by: qi-min <qim_34@163.com>
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
Signed-off-by: handan-yxh <yxh953167434@163.com>
Signed-off-by: qi-min <qim_34@163.com>
@qi-min qi-min force-pushed the agent-scheduler-development branch from 394f15d to f14eca0 Compare December 18, 2025 08:54
@qi-min
Copy link
Contributor Author

qi-min commented Dec 18, 2025

@hzxuzhonghu @wangyang0616 review comments have been handled, would you please review again, thanks.

@qi-min qi-min force-pushed the agent-scheduler-development branch from f14eca0 to 74efa74 Compare December 19, 2025 01:51
Signed-off-by: JesseStutler <chenzicong4@huawei.com>
@JesseStutler JesseStutler force-pushed the agent-scheduler-development branch from 74efa74 to 6363d12 Compare December 19, 2025 08:35
@wangyang0616
Copy link
Member

/lgtm

@volcano-sh-bot volcano-sh-bot added the lgtm Indicates that a PR is ready to be merged. label Dec 19, 2025
@wangyang0616
Copy link
Member

wangyang0616 commented Dec 19, 2025

Please confirm the closure status of the review comments. @hzxuzhonghu

@volcano-sh-bot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: hzxuzhonghu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@volcano-sh-bot volcano-sh-bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Dec 22, 2025
@volcano-sh-bot volcano-sh-bot merged commit 49be4a2 into volcano-sh:master Dec 22, 2025
22 of 23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants