profile handler ProcessResult returns additional return value #1013

nirrozenbaum · 2025-06-19T11:46:15Z

This PR make minor change to scheduler and ProfileHandler and delegates the decision on whether the Schedule call should fail or not when a single profile run fails.
more background about this PR:
in llm-d we have two configured profiles - Prefill and Decode.
if decode profile fails, we would like to fail the request. on the other hand, if prefill profile run fails, we would like to serve the request by decode only. this could happen if there is no available prefill pod at the time the request is sent.

More generally, since profiles and plugins are extensible and pluggable, it makes sense to delegate the decision on whether failing profile should fail the whole request or not to ProcessResults.

in the case where profile fails, it's profile result will be nil.

…alue Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

netlify · 2025-06-19T11:46:20Z

✅ Deploy Preview for gateway-api-inference-extension ready!

Name	Link
🔨 Latest commit	`fa2a92b`
🔍 Latest deploy log	https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68541a180016c700081e5927
😎 Deploy Preview	https://deploy-preview-1013--gateway-api-inference-extension.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

nirrozenbaum · 2025-06-19T11:54:21Z

cc @kfswain

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

elevran · 2025-06-19T12:15:52Z

pkg/epp/scheduling/framework/plugins.go

 	// It may aggregate results, log test profile outputs, or apply custom logic. It specifies in the SchedulingResult the
 	// key of the primary profile that should be used to get the request selected destination.
-	ProcessResults(ctx context.Context, request *types.LLMRequest, profileResults map[string]*types.ProfileRunResult) *types.SchedulingResult
+	// When a profile run fails, its result in the profileResults map is nil.


confirming: it will be an empty entry (key + nil), not no entry (no key at all)?

right. if prefill for example fail we will have an entry in the map "prefill" -> nil

Do we want SchedulingResult to carry an error so that erroring out is explicit?

that depends on the plugin implementation. whoever implements ProfileHandler can decide to filter out from the SchedulingResult the profiles that failed, or alternatively leave them with nil for being explicit. both are possible.

In GIE there is a single profile. having a failure means we cannot schedule the request. in different scenarios like llm-d PD it’s not always the case (see description in the PR intro)

Sorry, I meant changing ProfileRunResult to include the error parameter instead of assuming that there is an error when it is nil.

as part of the scheduling new design, one of the first changes that were made was to remove the error return value from all extensions of a profile. that is - filter, scorer and picker - none of them returns an error.

the only case where an error is returned from a Profile run is when no pods left available after the filter phase.
see here:

gateway-api-inference-extension/pkg/epp/scheduling/framework/scheduler_profile.go

Lines 109 to 122 in e29fa4b

func (p *SchedulerProfile) Run(ctx context.Context, request *types.LLMRequest, cycleState *types.CycleState, podsSnapshot []types.Pod) (*types.ProfileRunResult, error) {

pods := p.runFilterPlugins(ctx, request, cycleState, podsSnapshot)

if len(pods) == 0 {

return nil, errutil.Error{Code: errutil.Internal, Msg: "no pods available for the given request"}

}

// if we got here, there is at least one pod to score

weightedScorePerPod := p.runScorerPlugins(ctx, request, cycleState, pods)

result := p.runPickerPlugin(ctx, cycleState, weightedScorePerPod)

p.runPostCyclePlugins(ctx, cycleState, result)

return result, nil

}

additionally, we can notice that when that happens, the returned ProfileRunResult is nil.

in the scheduler itself, the code in this PR is implemented to include the nil result when error happens.
see here:

gateway-api-inference-extension/pkg/epp/scheduling/scheduler.go

Lines 122 to 130 in e29fa4b

for name, profile := range profiles {

// run the selected profiles and collect results (current code runs all profiles)

profileRunResult, err := profile.Run(ctx, request, cycleState, podsSnapshot)

if err != nil {

loggerDebug.Info("failed to run scheduler profile", "profile", name, "error", err.Error())

}

profileRunResults[name] = profileRunResult // if profile failed to run, the run result is nil

}

I feel comfortable with leaving it as is, but if you have a strong opinion about adding explicit error field to ProfileRunResult we can do that. just don't think it's necessary, not atm at least.

Sounds good

elevran · 2025-06-19T12:19:25Z

pkg/epp/scheduling/framework/plugins/profile/single_profile_handler.go

 		break
 	}

+	if profileResults[singleProfileName] == nil { // there was an error while running the profile


IIUC, it seems that you are special casing the first profile, even when there are multiple.
Consider raising the error only when there's a single profile result and it is nil (i.e., special case on if len(profileResults) == 1 { ... })?

this profile handler is intended to be used for a single profile, as the name suggests SingleProfileHandler.
added validation that it includes a single profile.

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

ahg-g · 2025-06-19T18:29:53Z

/approve
/lgtm
/hold

Holding in case we want to place the error inside SchedulingResult.

k8s-ci-robot · 2025-06-19T18:30:00Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [ahg-g]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

nirrozenbaum · 2025-06-19T18:48:11Z

Holding in case we want to place the error inside SchedulingResult.

after this PR - ProcessResult returns (*SchedulingResult, error).
I think this is more aligned with how error is usually returned in go and more specifically in k8s code rather than returning error inside SchedulingResult.

/unhold

…etes-sigs#1013) * profile handler ProcessResult returns an error as additional return value Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> * minor update Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> * fixed log Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> * validate that single profile handler process only single profile Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com> --------- Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

profile handler ProcessResult returns an error as additional return v…

603f718

…alue Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2025

k8s-ci-robot requested review from danehans and Jeffwan June 19, 2025 11:46

k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 19, 2025

minor update

e968ab6

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

fixed log

96b3fd6

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

elevran reviewed Jun 19, 2025

View reviewed changes

validate that single profile handler process only single profile

fa2a92b

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2025

k8s-ci-robot assigned ahg-g Jun 19, 2025

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2025

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2025

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2025

k8s-ci-robot merged commit e29fa4b into kubernetes-sigs:main Jun 19, 2025
9 checks passed

nirrozenbaum deleted the profile-error-handling branch June 19, 2025 19:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

profile handler ProcessResult returns additional return value #1013

profile handler ProcessResult returns additional return value #1013

Uh oh!

nirrozenbaum commented Jun 19, 2025 •

edited

Loading

Uh oh!

netlify bot commented Jun 19, 2025 •

edited

Loading

Uh oh!

nirrozenbaum commented Jun 19, 2025

Uh oh!

elevran Jun 19, 2025

Uh oh!

nirrozenbaum Jun 19, 2025

Uh oh!

ahg-g Jun 19, 2025

Uh oh!

nirrozenbaum Jun 19, 2025

Uh oh!

ahg-g Jun 19, 2025

Uh oh!

nirrozenbaum Jun 19, 2025

Uh oh!

ahg-g Jun 19, 2025

Uh oh!

elevran Jun 19, 2025

Uh oh!

nirrozenbaum Jun 19, 2025 •

edited

Loading

Uh oh!

ahg-g commented Jun 19, 2025

Uh oh!

k8s-ci-robot commented Jun 19, 2025

Uh oh!

nirrozenbaum commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

	func (p SchedulerProfile) Run(ctx context.Context, request types.LLMRequest, cycleState types.CycleState, podsSnapshot []types.Pod) (types.ProfileRunResult, error) {
	pods := p.runFilterPlugins(ctx, request, cycleState, podsSnapshot)
	if len(pods) == 0 {
	return nil, errutil.Error{Code: errutil.Internal, Msg: "no pods available for the given request"}
	}
	// if we got here, there is at least one pod to score
	weightedScorePerPod := p.runScorerPlugins(ctx, request, cycleState, pods)

	result := p.runPickerPlugin(ctx, cycleState, weightedScorePerPod)

	p.runPostCyclePlugins(ctx, cycleState, result)

	return result, nil
	}

	for name, profile := range profiles {
	// run the selected profiles and collect results (current code runs all profiles)
	profileRunResult, err := profile.Run(ctx, request, cycleState, podsSnapshot)
	if err != nil {
	loggerDebug.Info("failed to run scheduler profile", "profile", name, "error", err.Error())
	}

	profileRunResults[name] = profileRunResult // if profile failed to run, the run result is nil
	}

profile handler ProcessResult returns additional return value #1013

profile handler ProcessResult returns additional return value #1013

Uh oh!

Conversation

nirrozenbaum commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

netlify bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ Deploy Preview for gateway-api-inference-extension ready!

Uh oh!

nirrozenbaum commented Jun 19, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nirrozenbaum Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ahg-g commented Jun 19, 2025

Uh oh!

k8s-ci-robot commented Jun 19, 2025

Uh oh!

nirrozenbaum commented Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

nirrozenbaum commented Jun 19, 2025 •

edited

Loading

netlify bot commented Jun 19, 2025 •

edited

Loading

nirrozenbaum Jun 19, 2025 •

edited

Loading