Skip to content
This repository was archived by the owner on Jul 24, 2025. It is now read-only.

Conversation

kalantar
Copy link
Contributor

@kalantar kalantar commented Jun 12, 2025

Replace generation of P/D with LeaderWorkerSets instead of deployments.

Includes sample msvc and baseconfig files.

Signed-off-by: Michael Kalantar <kalantar@us.ibm.com>
@kalantar kalantar self-assigned this Jun 12, 2025
@kalantar kalantar marked this pull request as draft June 12, 2025 12:57
@kalantar
Copy link
Contributor Author

This PR provides capability to use LeaderWorkerSet as an alternative to a Deployment for the P/D pods. Supports simple expression of tensor and data parallelism. Currently supports on data local parallelism of 1.

The base lws configuration comes from https://github.com/tlrmchlsmth/vllm-dp-lws/blob/main/lws.yaml. It was slightly modified to (a) create explicit leaderTemplate (copy of existing workerTemplate) -- the sidecar should be added only tot he leader pod and (b) use modelservice template variables for ports and parallelism

@kalantar
Copy link
Contributor Author

samples/deepseek/lws-base.yaml is the baseconfig
samples/deepseek/deepseek-1t1d.yaml contains the modelservice manifest
samples/deepseek/deepseek-1t1d-manifest.yaml is the resulting manifest to deploy created by:

go run main.go \
--epp-cluster-role pod-read generate \
-m samples/deepseek/deepseek-1t1d.yaml \
-b samples/deepseek/lws-base.yaml \
| sed 's/^[a-zA-Z]*:/  ---/' \
| sed 's/^  //' \
> samples/deepseek/deepseek-1t1d-manifest.yaml

Signed-off-by: Michael Kalantar <kalantar@us.ibm.com>
@kalantar
Copy link
Contributor Author

kalantar commented Jun 12, 2025

Serving inference requests in llm-d where each P/D node is deployed over multiple pods.

A project that shows how to host a model with multiple pods per P/D node is --> https://github.com/tlrmchlsmth/vllm-dp-lws/tree/main

To do this with llm-d we show the steps to deploy the llm-d inference scheduler. We give instructions using kgateway.

  1. If not installed, install the Kubernetes Inference Gateway Extension CRDs.

    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api/releases/download/v1.3.0/standard-install.yaml
    kubectl apply -f https://github.com/kubernetes-sigs/gateway-api-inference-extension/releases/download/v0.3.0/manifests.yaml
  2. Install kgateway

  1. Define ClusterRole for the endpoint picker ([llm-d-inference-scheduler(https://github.com/llm-d/llm-d-inference-scheduler)]):

    cat <<EOF | kubectl apply -f -
    kind: ClusterRole
    apiVersion: rbac.authorization.k8s.io/v1
    metadata:
      name: pod-read
    rules:
    - apiGroups: ["inference.networking.x-k8s.io"]
      resources: ["inferencemodels"]
      verbs: ["get", "watch", "list"]
    - apiGroups: [""]
      resources: ["pods"]
      verbs: ["get", "watch", "list"]
    - apiGroups: ["inference.networking.x-k8s.io"]
      resources: ["inferencepools"]
      verbs: ["get", "watch", "list"]
    - apiGroups: ["discovery.k8s.io"]
      resources: ["endpointslices"]
      verbs: ["get", "watch", "list"]
    - apiGroups:
      - authentication.k8s.io
      resources:
      - tokenreviews
      verbs:
      - create
    - apiGroups:
      - authorization.k8s.io
      resources:
      - subjectaccessreviews
      verbs:
      - create
    EOF
    
  2. Create an inference gateway in target namespace:

    cat <<EOF | kubectl apply -f -
    apiVersion: gateway.kgateway.dev/v1alpha1
    kind: GatewayParameters
    metadata:
      name: inference-gateway-params
    spec:
      kube:
        service:
          type: ClusterIP
    EOF
    
    cat <<EOF | kubectl apply -f -
    apiVersion: gateway.networking.k8s.io/v1
    kind: Gateway
    metadata:
      name: inference-gateway
    spec:
      gatewayClassName: kgateway
      infrastructure:
        parametersRef:
          group: gateway.kgateway.dev
          kind: GatewayParameters
          name: inference-gateway-params
      listeners:
      - allowedRoutes:
          namespaces:
            from: Same
        name: http
        port: 80
        protocol: HTTP
    EOF
    

    An inference gateway pod should start.

  3. Define a Secret containing your Hugging Face token:

    kubectl create secret generic hf-secret --from-literal=HF_TOKEN=$HF_TOKEN
    
  4. Create a ConfigMap containing the script used by the pods to install/configure vllm (this is an alternative to a suitable vllm image)

    kubectl create configmap vllm-init-scripts-config --from-file=init-vllm.sh
  5. Deploy deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct model

  • With tensor parallelism 1 and data parallelism 1 (ie, LWS w/ 1 pod, 1 H200 GPU per pod):

    kubectl apply -f https://raw.githubusercontent.com/llm-d/llm-d-model-service/9272d1ab4ec7a3feca83b3eaa4127dd531a37f90/samples/deepseek/deepseek-1t1d-manifest.yaml
  • With tensor parallelism 1 and data parallelism 2 (ie, LWS w/ 2 pods, 1 H200 GPU per pod):

    https://raw.githubusercontent.com/llm-d/llm-d-model-service/9272d1ab4ec7a3feca83b3eaa4127dd531a37f90/samples/deepseek/deepseek-1t2d-manifest.yaml
  • With tensor parallelism 2 and data parallelism 2 (ie, LWS w/ 2 pods, 2 H200 GPU per pod):

    https://raw.githubusercontent.com/llm-d/llm-d-model-service/9272d1ab4ec7a3feca83b3eaa4127dd531a37f90/samples/deepseek/deepseek-2t2d-manifest.yaml

    In each case, only a single P/D node is deployed. kubectl scale can be used to change this.

  1. Test

    kubectl port-forward svc/inference-gateway 8080:80
    curl -vvv localhost:8080/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
      "model": "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
      "n": 1,
      "prompt": "In a land far, far away,"
    }'
    curl -vvv localhost:8080/v1/chat/completions  \
    -H "Content-Type: application/json" \
    -d '{
      "model": "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
      "n": 1,
      "messages": [
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Who won the world series in 2020?"},
        {"role": "assitant", "content": "2020 World Series was won by the Los Angeles Dodgers."},
        {"role": "user", "content": "How many times have the Dodgers won?"}
      ]
    }'
  2. Cleanup - remove model objects created in step 7.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant