Skip to content

Conversation

nirrozenbaum
Copy link
Contributor

@nirrozenbaum nirrozenbaum commented Jun 19, 2025

This PR make minor change to scheduler and ProfileHandler and delegates the decision on whether the Schedule call should fail or not when a single profile run fails.
more background about this PR:
in llm-d we have two configured profiles - Prefill and Decode.
if decode profile fails, we would like to fail the request. on the other hand, if prefill profile run fails, we would like to serve the request by decode only. this could happen if there is no available prefill pod at the time the request is sent.

More generally, since profiles and plugins are extensible and pluggable, it makes sense to delegate the decision on whether failing profile should fail the whole request or not to ProcessResults.

in the case where profile fails, it's profile result will be nil.

…alue

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
Copy link

netlify bot commented Jun 19, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit fa2a92b
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68541a180016c700081e5927
😎 Deploy Preview https://deploy-preview-1013--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot requested review from danehans and Jeffwan June 19, 2025 11:46
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 19, 2025
Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
@nirrozenbaum
Copy link
Contributor Author

cc @kfswain

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
// It may aggregate results, log test profile outputs, or apply custom logic. It specifies in the SchedulingResult the
// key of the primary profile that should be used to get the request selected destination.
ProcessResults(ctx context.Context, request *types.LLMRequest, profileResults map[string]*types.ProfileRunResult) *types.SchedulingResult
// When a profile run fails, its result in the profileResults map is nil.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirming: it will be an empty entry (key + nil), not no entry (no key at all)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. if prefill for example fail we will have an entry in the map "prefill" -> nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want SchedulingResult to carry an error so that erroring out is explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that depends on the plugin implementation. whoever implements ProfileHandler can decide to filter out from the SchedulingResult the profiles that failed, or alternatively leave them with nil for being explicit. both are possible.

In GIE there is a single profile. having a failure means we cannot schedule the request. in different scenarios like llm-d PD it’s not always the case (see description in the PR intro)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant changing ProfileRunResult to include the error parameter instead of assuming that there is an error when it is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as part of the scheduling new design, one of the first changes that were made was to remove the error return value from all extensions of a profile. that is - filter, scorer and picker - none of them returns an error.

the only case where an error is returned from a Profile run is when no pods left available after the filter phase.
see here:

func (p *SchedulerProfile) Run(ctx context.Context, request *types.LLMRequest, cycleState *types.CycleState, podsSnapshot []types.Pod) (*types.ProfileRunResult, error) {
pods := p.runFilterPlugins(ctx, request, cycleState, podsSnapshot)
if len(pods) == 0 {
return nil, errutil.Error{Code: errutil.Internal, Msg: "no pods available for the given request"}
}
// if we got here, there is at least one pod to score
weightedScorePerPod := p.runScorerPlugins(ctx, request, cycleState, pods)
result := p.runPickerPlugin(ctx, cycleState, weightedScorePerPod)
p.runPostCyclePlugins(ctx, cycleState, result)
return result, nil
}

additionally, we can notice that when that happens, the returned ProfileRunResult is nil.

in the scheduler itself, the code in this PR is implemented to include the nil result when error happens.
see here:

for name, profile := range profiles {
// run the selected profiles and collect results (current code runs all profiles)
profileRunResult, err := profile.Run(ctx, request, cycleState, podsSnapshot)
if err != nil {
loggerDebug.Info("failed to run scheduler profile", "profile", name, "error", err.Error())
}
profileRunResults[name] = profileRunResult // if profile failed to run, the run result is nil
}

I feel comfortable with leaving it as is, but if you have a strong opinion about adding explicit error field to ProfileRunResult we can do that. just don't think it's necessary, not atm at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

break
}

if profileResults[singleProfileName] == nil { // there was an error while running the profile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it seems that you are special casing the first profile, even when there are multiple.
Consider raising the error only when there's a single profile result and it is nil (i.e., special case on if len(profileResults) == 1 { ... })?

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this profile handler is intended to be used for a single profile, as the name suggests SingleProfileHandler.
added validation that it includes a single profile.

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
@ahg-g
Copy link
Contributor

ahg-g commented Jun 19, 2025

/approve
/lgtm
/hold

Holding in case we want to place the error inside SchedulingResult.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahg-g, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 19, 2025
@nirrozenbaum
Copy link
Contributor Author

Holding in case we want to place the error inside SchedulingResult.

after this PR - ProcessResult returns (*SchedulingResult, error).
I think this is more aligned with how error is usually returned in go and more specifically in k8s code rather than returning error inside SchedulingResult.

/unhold

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot merged commit e29fa4b into kubernetes-sigs:main Jun 19, 2025
9 checks passed
@nirrozenbaum nirrozenbaum deleted the profile-error-handling branch June 19, 2025 19:46
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 26, 2025
…etes-sigs#1013)

* profile handler ProcessResult returns an error as additional return value

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* minor update

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* fixed log

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* validate that single profile handler process only single profile

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

---------

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
rlakhtakia pushed a commit to rlakhtakia/gateway-api-inference-extension that referenced this pull request Jun 26, 2025
…etes-sigs#1013)

* profile handler ProcessResult returns an error as additional return value

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* minor update

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* fixed log

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* validate that single profile handler process only single profile

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

---------

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
EyalPazz pushed a commit to EyalPazz/gateway-api-inference-extension that referenced this pull request Jul 9, 2025
…etes-sigs#1013)

* profile handler ProcessResult returns an error as additional return value

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* minor update

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* fixed log

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

* validate that single profile handler process only single profile

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>

---------

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants