Skip to content

Conversation

nirrozenbaum
Copy link
Contributor

This PR adds the PreRequest extension point to requestcontrol layer.
the registered PreRequest plugins will be invoked after a successful result was received from the scheduling layer (that is, a successful SchedulingResult). the extension allows wiring up multi profile results using the request properties (e.g., Headers that will be later added to the request using the generateHeaders helper func).

More specifically, this is the enabler for clean coding of the Prefill/Decode wiring in llm-d.

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 17, 2025
@k8s-ci-robot k8s-ci-robot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jun 17, 2025
Copy link

netlify bot commented Jun 17, 2025

Deploy Preview for gateway-api-inference-extension ready!

Name Link
🔨 Latest commit 0182be4
🔍 Latest deploy log https://app.netlify.com/projects/gateway-api-inference-extension/deploys/68527ccba532720008b93698
😎 Deploy Preview https://deploy-preview-1004--gateway-api-inference-extension.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@nirrozenbaum
Copy link
Contributor Author

nirrozenbaum commented Jun 17, 2025

cc @kfswain @liu-cong @kfirtoledo

targetPort := int(pool.Spec.TargetPortNumber)

endpoint := targetPod.Address + ":" + strconv.Itoa(int(pool.Spec.TargetPortNumber))
endpoint := net.JoinHostPort(targetPod.Address, strconv.Itoa(targetPort))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

}

// PostDispatch populates the RequestContext based on scheduling results.
func (d *Director) PostDispatch(ctx context.Context, reqCtx *handlers.RequestContext, result *schedulingtypes.SchedulingResult) (*handlers.RequestContext, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed renaming PostDispatch -> PreRequest, this PR seems reasonable to do that.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the function names including comments.
the use of the term Dispatch is now removed completely.
we discussed it in the past, that this terminology might be confusing for a reader cause the request is not dispatched at this point in the code.

the new function names are:

  • admitRequest - handles admission control to decide whether or not to accept the request
    based on the request criticality and system saturation state.
  • no more Dispatch function. instead just call Scheduler.Schedule (there was no logic in Dispatch other than that).
  • prepareRequest - populates the RequestContext and calls the registered PreRequest plugins
    for allowing plugging customized logic based on the scheduling results.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM!

@kfswain
Copy link
Collaborator

kfswain commented Jun 17, 2025

Overall LGTM, just some naming changes that I think belong in this PR. Thanks!

// before a request is sent to the selected model server.
type PreRequest interface {
plugins.Plugin
PreRequest(ctx context.Context, request *types.LLMRequest, schedulingResult *types.SchedulingResult, targetPort int)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we pass targetPod as an argument instead of schedulingResult. Given the multi-profile scheduler architecture, it's unclear if schedulingResult is the result of one profile or end result.

Copy link
Contributor Author

@nirrozenbaum nirrozenbaum Jun 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

with the new scheduler design, there are two different results under the scheduling package.
we have:

  • ProfileRunResult - which represents a single profile run result.
  • SchedulingResult - which is a map from profile name to it's ProfileRunResult + a field that specifies the primary profile that should be used in the destination header.

to your question - passing SchedulingResult and not the targetPod is intentional. PreRequest extension point is exactly the place where we can make sense of the multi profile results.
For example, in llm-d and PD use case, this is the place where we wire prefill selected endpoint(s) in a dedicated header when returning the decode selection. This is the way we wire P + D selected endpoints.
example can be found here:
https://github.com/llm-d/llm-d-inference-scheduler/blob/0c49737834fc9f2b5213447437ac4815b1d5a98c/pkg/plugins/pre-request/pd_prerequest.go#L33-L37

to summarize, we need to keep SchedulingResult here and not only targetPod. Otherwise, there is no place to make sense of the results of the profiles other than the primary one.
if one wants to get the targetPod, it can be done same as was done here:

targetPod := result.ProfileResults[result.PrimaryProfileName].TargetPod.GetPod()

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, this makes sense

Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
Signed-off-by: Nir Rozenbaum <nirro@il.ibm.com>
@k8s-ci-robot k8s-ci-robot added size/L Denotes a PR that changes 100-499 lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jun 18, 2025
@liu-cong
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 18, 2025
@kfswain
Copy link
Collaborator

kfswain commented Jun 18, 2025

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: kfswain, nirrozenbaum

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 18, 2025
@k8s-ci-robot k8s-ci-robot merged commit f66be2d into kubernetes-sigs:main Jun 18, 2025
9 checks passed
@nirrozenbaum nirrozenbaum deleted the pre-request-plugin branch June 19, 2025 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants