Utilities and workflow helpers for managing LLM-D deployments in Kubernetes. Clone with submodules:
git clone --recursive https://github.com/LucasWilkinson/llm-d-utils
- Prerequisites
- Initial Setup
- Everyday Commands
- Benchmark Configuration
- Building Custom vLLM Images
- Troubleshooting
Make sure the following tools are installed and available in your PATH
:
- just for running the recipes in this repo
- kubectl configured for the target cluster
- helm
- stern for streaming pod logs
- watch
- Optional: fzf for the nicer interactive pod pickers used by several recipes
-
Create a
.env
fileThe Justfile loads environment variables via
set dotenv-load
. Create a.env
file in the project root with your configuration and secrets:USER_NAME=your-username HF_TOKEN=your-huggingface-token GH_TOKEN=your-github-token QUAY_REPO=your-quay-username QUAY_ROBOT=buildbot QUAY_PASSWORD=your-robot-account-token
USER_NAME
is used to generate your namespace:USER_NAME + "-llm-d-wide-ep"
(defaults to your system username if not set)
To get quay.io credentials:
- Log into quay.io (via SSO)
- Go to Account Settings → Robot Accounts
- Create a new robot account (e.g.,
buildbot
) - Copy the token and use it as
QUAY_PASSWORD
QUAY_REPO
should be your quay.io username (not the robot account name)- The full robot account name will be constructed as
QUAY_REPO+QUAY_ROBOT
IMPORTANT: Before building, you must also:
- Create the repository
llm-d-cuda-dev
in quay.io (can be public or private) - Go to the repository → Settings → User and Robot Permissions
- Add your robot account (
QUAY_REPO+QUAY_ROBOT
) with Write permission
These values are required for the secret creation step below.
-
Point kubectl at your token file
Export the kubeconfig path you received from the platform (example path shown below):
export KUBECONFIG=~/kubectl-token.txt
-
Create Kubernetes secrets
Run:
just create-secrets
This will create (or update) the
llm-d-hf-token
,gh-token-secret
, andregistry-auth
secrets in your namespace using the values from.env
. -
(Optional) Set your kubectl namespace
To avoid specifying
-n {{NAMESPACE}}
manually, update your context with:just set-namespace
-
Deploy the workload
Launch the deployment using Kustomize and Helm:
just start
This will:
- Deploy model servers using
kubectl apply -k
(CoreWeave variant) - Install the InferencePool via Helm (with Istio gateway)
- Deploy the Istio gateway and HTTPRoute
To tear it back down, run
just stop
. This removes the Helm release, model server manifests, and gateway resources.The deployment uses manifests from
llm-d/guides/wide-ep-lws/manifests/
and values fromllm-d/guides/wide-ep-lws/inferencepool.values.yaml
.The benchmarking helpers (e.g.
just run-bench
) default to the deployment's model (deepseek-ai/DeepSeek-R1-0528
). If you change the model, update theMODEL
variable near the top of theJustfile
so the generated remote Justfile targets the right endpoint. - Deploy model servers using
-
just start
Deploy the full stack (model servers, InferencePool, gateway) using Kustomize and Helm.
-
just stop
Tear down the deployment (removes Helm release, model server manifests, and gateway).
-
just restart
Stop and start the deployment (
just stop && just start
). -
just update-image TAG
Update the decode.yaml and prefill.yaml manifests to use a custom image with the specified tag. Example:
just update-image test-latest-main
-
just get-pods
List all pods in the configured namespace.
-
just status
Watch pod status in real-time using
watch -n 2 kubectl get pods
. -
just describe [name=pod-name]
Describe a pod. If
name
is omitted, you'll get an interactive picker. Requiresfzf
for fuzzy selection, otherwise falls back to shellselect
. -
just stern [name=pod-name] [-- <stern flags>]
Stream logs from pods using stern. With no
name
, you get the interactive picker. Flags after--
are forwarded to stern (e.g.,just stern -- -c vllm-worker
). -
just print-gpus
Show GPU allocation across all cluster nodes, grouped by node and namespace.
-
just cks-nodes
Display CoreWeave node information (type, link speed, IB speed, reliability, etc.).
-
just start-bench
Create the benchmark-interactive pod for running benchmarks.
-
just stop-bench
Delete the benchmark-interactive pod.
-
just restart-bench
Stop and start the benchmark pod (
just stop-bench && just start-bench
). -
just interact-bench
Open an interactive shell in the benchmark pod with the Justfile and scripts copied in.
-
just run-bench NAME [IN_TOKENS] [OUT_TOKENS] [NUM_PROMPTS] [CONCURRENCY_LEVELS]
Run a benchmark with the specified name and parameters. Parameters are positional. Example:
just run-bench run1 256 1024 8192
. See "Benchmark Configuration" below for details. -
just cp-results
Copy the most recent benchmark results from the benchmark pod to
results/<timestamp>
locally.
-
just start-build-pod
Create the buildah build pod for building custom vLLM images.
-
just stop-build-pod
Delete the buildah build pod.
-
just build-image VLLM_COMMIT TAG [use_sccache]
Build a custom vLLM image with the specified commit SHA and tag.
use_sccache
defaults totrue
. Example:just build-image abc123def my-custom-tag false
-
just set-namespace
Update your kubectl context to default to the configured namespace.
-
just create-secrets
Create or update Kubernetes secrets (HF token, GH token, registry auth) from
.env
file. -
just create-registry-auth
Create or update only the registry authentication secret.
-
just print-results DIR STR
Grep for a string in benchmark result logs and print sorted results.
-
just print-throughput DIR
Print output token throughput from benchmark results in a directory.
-
just print-tpot DIR
Print median time-per-output-token (TPOT) from benchmark results in a directory.
just run-bench
accepts parameters to tune the benchmark payload. Parameters can be passed either positionally or as named arguments:
Positional (recommended):
just run-bench run1 256 1024 8192
Named arguments:
just run-bench name=run1 in_tokens=256 out_tokens=1024 num_prompts=8192
name
(required): Benchmark run name for organizing resultsin_tokens
(default128
): Prompt length fed tovllm bench
out_tokens
(default2048
): Target completion lengthnum_prompts
(default16384
): Total requests per concurrency levelconcurrency_levels
(default'8192 16384 32768'
): Space-separated list of concurrency levels to sweep
These values are forwarded to the benchmark pod as environment variables. You can also invoke the benchmark manually:
kubectl exec -n NAMESPACE benchmark-interactive -- \
env INPUT_TOKENS=256 OUTPUT_TOKENS=1024 NUM_PROMPTS=8192 \
bash /app/run.sh
To build a custom vLLM image with a specific commit:
-
Start the build pod:
just start-build-pod
-
Build and push the image:
just build-image VLLM_COMMIT_SHA TAG # Example: just build-image 8ce5d3198d00631a76e1aa02a57947b46bc7218c mtp-enabled
This will:
- Clone the llm-d repository
- Update the Dockerfile with your specified vLLM commit
- Build the image using buildah
- Push to
quay.io/QUAY_REPO/llm-d-cuda-dev:TAG
-
Update the manifests:
Edit
llm-d/guides/wide-ep-lws/manifests/modelserver/base/decode.yaml
andprefill.yaml
to use your custom image:image: quay.io/your-repo/llm-d-cuda-dev:your-tag
-
Clean up the build pod:
just stop-build-pod
Note: The build takes 30-60+ minutes. Monitor progress with:
kubectl logs -f buildah-build -n your-namespace
- If
just
reports missing environment variables, double-check your.env
file and ensure you’re running commands from the repository root. - Kubernetes errors such as
CreateContainerConfigError
usually indicate a missing or misnamed secret; re-runjust create-secrets
after updating.env
, or inspect the pod events viajust describe name=...
. - For log streaming issues, ensure
stern
is installed and your kubeconfig points to the correct cluster.
With the setup above you should be able to deploy, inspect, and debug the LLMD workloads quickly using the provided Just recipes.