Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions pkg/epp/common/config/configloader_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -515,8 +515,8 @@ func (p *testProfileHandler) Pick(ctx context.Context, request *types.LLMRequest
return nil
}

func (p *testProfileHandler) ProcessResults(ctx context.Context, request *types.LLMRequest, profileResults map[string]*types.ProfileRunResult) *types.SchedulingResult {
return nil
func (p *testProfileHandler) ProcessResults(ctx context.Context, request *types.LLMRequest, profileResults map[string]*types.ProfileRunResult) (*types.SchedulingResult, error) {
return nil, nil
}

func registerTestPlugins() {
Expand Down
5 changes: 3 additions & 2 deletions pkg/epp/scheduling/framework/plugins.go
Original file line number Diff line number Diff line change
Expand Up @@ -40,10 +40,11 @@ type ProfileHandler interface {
// and the previously executed SchedluderProfile cycles along with their results.
Pick(ctx context.Context, request *types.LLMRequest, profiles map[string]*SchedulerProfile, profileResults map[string]*types.ProfileRunResult) map[string]*SchedulerProfile

// ProcessResults handles the outcome of the profile runs after all profiles ran succuessfully.
// ProcessResults handles the outcome of the profile runs after all profiles ran.
// It may aggregate results, log test profile outputs, or apply custom logic. It specifies in the SchedulingResult the
// key of the primary profile that should be used to get the request selected destination.
ProcessResults(ctx context.Context, request *types.LLMRequest, profileResults map[string]*types.ProfileRunResult) *types.SchedulingResult
// When a profile run fails, its result in the profileResults map is nil.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

confirming: it will be an empty entry (key + nil), not no entry (no key at all)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right. if prefill for example fail we will have an entry in the map "prefill" -> nil

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want SchedulingResult to carry an error so that erroring out is explicit?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that depends on the plugin implementation. whoever implements ProfileHandler can decide to filter out from the SchedulingResult the profiles that failed, or alternatively leave them with nil for being explicit. both are possible.

In GIE there is a single profile. having a failure means we cannot schedule the request. in different scenarios like llm-d PD it’s not always the case (see description in the PR intro)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I meant changing ProfileRunResult to include the error parameter instead of assuming that there is an error when it is nil.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as part of the scheduling new design, one of the first changes that were made was to remove the error return value from all extensions of a profile. that is - filter, scorer and picker - none of them returns an error.

the only case where an error is returned from a Profile run is when no pods left available after the filter phase.
see here:

func (p *SchedulerProfile) Run(ctx context.Context, request *types.LLMRequest, cycleState *types.CycleState, podsSnapshot []types.Pod) (*types.ProfileRunResult, error) {
pods := p.runFilterPlugins(ctx, request, cycleState, podsSnapshot)
if len(pods) == 0 {
return nil, errutil.Error{Code: errutil.Internal, Msg: "no pods available for the given request"}
}
// if we got here, there is at least one pod to score
weightedScorePerPod := p.runScorerPlugins(ctx, request, cycleState, pods)
result := p.runPickerPlugin(ctx, cycleState, weightedScorePerPod)
p.runPostCyclePlugins(ctx, cycleState, result)
return result, nil
}

additionally, we can notice that when that happens, the returned ProfileRunResult is nil.

in the scheduler itself, the code in this PR is implemented to include the nil result when error happens.
see here:

for name, profile := range profiles {
// run the selected profiles and collect results (current code runs all profiles)
profileRunResult, err := profile.Run(ctx, request, cycleState, podsSnapshot)
if err != nil {
loggerDebug.Info("failed to run scheduler profile", "profile", name, "error", err.Error())
}
profileRunResults[name] = profileRunResult // if profile failed to run, the run result is nil
}

I feel comfortable with leaving it as is, but if you have a strong opinion about adding explicit error field to ProfileRunResult we can do that. just don't think it's necessary, not atm at least.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good

ProcessResults(ctx context.Context, request *types.LLMRequest, profileResults map[string]*types.ProfileRunResult) (*types.SchedulingResult, error)
}

// Filter defines the interface for filtering a list of pods based on context.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ package profile
import (
"context"
"encoding/json"
"errors"
"fmt"

"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/plugins"
"sigs.k8s.io/gateway-api-inference-extension/pkg/epp/scheduling/framework"
Expand Down Expand Up @@ -58,15 +60,28 @@ func (h *SingleProfileHandler) Pick(_ context.Context, request *types.LLMRequest
return profiles
}

func (h *SingleProfileHandler) ProcessResults(_ context.Context, _ *types.LLMRequest, profileResults map[string]*types.ProfileRunResult) *types.SchedulingResult {
var firstKey string
for key := range profileResults {
firstKey = key
// ProcessResults handles the outcome of the profile runs after all profiles ran.
// It may aggregate results, log test profile outputs, or apply custom logic. It specifies in the SchedulingResult the
// key of the primary profile that should be used to get the request selected destination.
// When a profile run fails, its result in the profileResults map is nil.
func (h *SingleProfileHandler) ProcessResults(_ context.Context, _ *types.LLMRequest,
profileResults map[string]*types.ProfileRunResult) (*types.SchedulingResult, error) {
if len(profileResults) != 1 {
return nil, errors.New("single profile handler is intended to be used with a single profile, failed to process multiple profiles")
}

var singleProfileName string
for profileName := range profileResults {
singleProfileName = profileName
break
}

if profileResults[singleProfileName] == nil { // there was an error while running the profile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, it seems that you are special casing the first profile, even when there are multiple.
Consider raising the error only when there's a single profile result and it is nil (i.e., special case on if len(profileResults) == 1 { ... })?

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this profile handler is intended to be used for a single profile, as the name suggests SingleProfileHandler.
added validation that it includes a single profile.

return nil, fmt.Errorf("failed to run scheduler profile '%s'", singleProfileName)
}

return &types.SchedulingResult{
ProfileResults: profileResults,
PrimaryProfileName: firstKey,
}
PrimaryProfileName: singleProfileName,
}, nil
}
8 changes: 4 additions & 4 deletions pkg/epp/scheduling/scheduler.go
Original file line number Diff line number Diff line change
Expand Up @@ -123,10 +123,10 @@ func (s *Scheduler) Schedule(ctx context.Context, request *types.LLMRequest) (*t
// run the selected profiles and collect results (current code runs all profiles)
profileRunResult, err := profile.Run(ctx, request, cycleState, podsSnapshot)
if err != nil {
return nil, fmt.Errorf("failed to run all required scheduling profiles - %w", err)
loggerDebug.Info("failed to run scheduler profile", "profile", name, "error", err.Error())
}

profileRunResults[name] = profileRunResult
profileRunResults[name] = profileRunResult // if profile failed to run, the run result is nil
}
}

Expand All @@ -135,8 +135,8 @@ func (s *Scheduler) Schedule(ctx context.Context, request *types.LLMRequest) (*t
}

before := time.Now()
result := s.profileHandler.ProcessResults(ctx, request, profileRunResults)
result, err := s.profileHandler.ProcessResults(ctx, request, profileRunResults)
metrics.RecordSchedulerPluginProcessingLatency(framework.ProcessProfilesResultsType, s.profileHandler.Name(), time.Since(before))

return result, nil
return result, err
}