generate lws based yaml #219

kalantar · 2025-06-12T12:57:52Z

Replace generation of P/D with LeaderWorkerSets instead of deployments.

Includes sample msvc and baseconfig files.

Signed-off-by: Michael Kalantar <kalantar@us.ibm.com>

kalantar · 2025-06-12T13:22:13Z

This PR provides capability to use LeaderWorkerSet as an alternative to a Deployment for the P/D pods. Supports simple expression of tensor and data parallelism. Currently supports on data local parallelism of 1.

The base lws configuration comes from https://github.com/tlrmchlsmth/vllm-dp-lws/blob/main/lws.yaml. It was slightly modified to (a) create explicit leaderTemplate (copy of existing workerTemplate) -- the sidecar should be added only tot he leader pod and (b) use modelservice template variables for ports and parallelism

kalantar · 2025-06-12T13:39:04Z

samples/deepseek/lws-base.yaml is the baseconfig
samples/deepseek/deepseek-1t1d.yaml contains the modelservice manifest
samples/deepseek/deepseek-1t1d-manifest.yaml is the resulting manifest to deploy created by:

go run main.go \
--epp-cluster-role pod-read generate \
-m samples/deepseek/deepseek-1t1d.yaml \
-b samples/deepseek/lws-base.yaml \
| sed 's/^[a-zA-Z]*:/  ---/' \
| sed 's/^  //' \
> samples/deepseek/deepseek-1t1d-manifest.yaml

Signed-off-by: Michael Kalantar <kalantar@us.ibm.com>

kalantar · 2025-06-12T16:58:11Z

Serving inference requests in llm-d where each P/D node is deployed over multiple pods.

A project that shows how to host a model with multiple pods per P/D node is --> https://github.com/tlrmchlsmth/vllm-dp-lws/tree/main

To do this with llm-d we show the steps to deploy the llm-d inference scheduler. We give instructions using kgateway.

If not installed, install the Kubernetes Inference Gateway Extension CRDs.

kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml

Install kgateway

Define ClusterRole for the endpoint picker ([llm-d-inference-scheduler(https://github.com/llm-d/llm-d-inference-scheduler)]):

cat <<EOF | kubectl apply -f -
kind: ClusterRole
apiVersion: rbac.authorization.k8s.io/v1
metadata:
  name: pod-read
rules:
- apiGroups: ["inference.networking.x-k8s.io"]
  resources: ["inferencemodels"]
  verbs: ["get", "watch", "list"]
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["inference.networking.x-k8s.io"]
  resources: ["inferencepools"]
  verbs: ["get", "watch", "list"]
- apiGroups: ["discovery.k8s.io"]
  resources: ["endpointslices"]
  verbs: ["get", "watch", "list"]
- apiGroups:
  - authentication.k8s.io
  resources:
  - tokenreviews
  verbs:
  - create
- apiGroups:
  - authorization.k8s.io
  resources:
  - subjectaccessreviews
  verbs:
  - create
EOF

Create an inference gateway in target namespace:

cat <<EOF | kubectl apply -f -
apiVersion: gateway.kgateway.dev/v1alpha1
kind: GatewayParameters
metadata:
  name: inference-gateway-params
spec:
  kube:
    service:
      type: ClusterIP
EOF

cat <<EOF | kubectl apply -f -
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
  name: inference-gateway
spec:
  gatewayClassName: kgateway
  infrastructure:
    parametersRef:
      group: gateway.kgateway.dev
      kind: GatewayParameters
      name: inference-gateway-params
  listeners:
  - allowedRoutes:
      namespaces:
        from: Same
    name: http
    port: 80
    protocol: HTTP
EOF

An inference gateway pod should start.

Define a Secret containing your Hugging Face token:

kubectl create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN

Create a ConfigMap containing the script used by the pods to install/configure vllm (this is an alternative to a suitable vllm image)
```
kubectl create configmap vllm-init-scripts-config --from-file=init-vllm.sh
```
Deploy deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct model

With tensor parallelism 1 and data parallelism 1 (ie, LWS w/ 1 pod, 1 H200 GPU per pod):

kubectl apply -f https://raw.githubusercontent.com/llm-d/llm-d-model-service/9272d1ab4ec7a3feca83b3eaa4127dd531a37f90/samples/deepseek/deepseek-1t1d-manifest.yaml

With tensor parallelism 1 and data parallelism 2 (ie, LWS w/ 2 pods, 1 H200 GPU per pod):

https://raw.githubusercontent.com/llm-d/llm-d-model-service/9272d1ab4ec7a3feca83b3eaa4127dd531a37f90/samples/deepseek/deepseek-1t2d-manifest.yaml

With tensor parallelism 2 and data parallelism 2 (ie, LWS w/ 2 pods, 2 H200 GPU per pod):
```
https://raw.githubusercontent.com/llm-d/llm-d-model-service/9272d1ab4ec7a3feca83b3eaa4127dd531a37f90/samples/deepseek/deepseek-2t2d-manifest.yaml
```
In each case, only a single P/D node is deployed. kubectl scale can be used to change this.

Test

kubectl port-forward svc/inference-gateway 8080:80

curl -vvv localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
  "model": "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
  "n": 1,
  "prompt": "In a land far, far away,"
}'

curl -vvv localhost:8080/v1/chat/completions  \
-H "Content-Type: application/json" \
-d '{
  "model": "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
  "n": 1,
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Who won the world series in 2020?"},
    {"role": "assitant", "content": "2020 World Series was won by the Los Angeles Dodgers."},
    {"role": "user", "content": "How many times have the Dodgers won?"}
  ]
}'

Cleanup - remove model objects created in step 7.

generate lws based yaml

584d585

Signed-off-by: Michael Kalantar <kalantar@us.ibm.com>

kalantar self-assigned this Jun 12, 2025

kalantar marked this pull request as draft June 12, 2025 12:57

more examples

9272d1a

Signed-off-by: Michael Kalantar <kalantar@us.ibm.com>

This was referenced Jul 16, 2025

Support multi-node inferencing using LWS #208

Open

[0.2 Release] improved multi-node dp/ep moe "well lit" path llm-d/llm-d#131

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

generate lws based yaml #219

generate lws based yaml #219

Uh oh!

kalantar commented Jun 12, 2025 •

edited

Loading

Uh oh!

kalantar commented Jun 12, 2025

Uh oh!

kalantar commented Jun 12, 2025

Uh oh!

kalantar commented Jun 12, 2025 •

edited

Loading

Uh oh!

Uh oh!

generate lws based yaml #219

Are you sure you want to change the base?

generate lws based yaml #219

Uh oh!

Conversation

kalantar commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kalantar commented Jun 12, 2025

Uh oh!

kalantar commented Jun 12, 2025

Uh oh!

kalantar commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

kalantar commented Jun 12, 2025 •

edited

Loading

kalantar commented Jun 12, 2025 •

edited

Loading