-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[Doc]: Add deploying_with_k8s guide #8451
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,173 @@ | ||||
.. _deploying_with_k8s: | ||||
|
||||
Deploying with Kubernetes | ||||
========================== | ||||
|
||||
Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing. | ||||
|
||||
Prerequisites | ||||
------------- | ||||
Before you begin, ensure that you have the following: | ||||
|
||||
- A running Kubernetes cluster | ||||
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/` | ||||
- Available GPU resources in your cluster | ||||
|
||||
Deployment Steps | ||||
---------------- | ||||
|
||||
1. **Create a PVC , Secret and Deployment for vLLM** | ||||
|
||||
Create a Kubernetes deployment file that includes a PersistentVolumeClaim (PVC) and a Secret necessary for pulling the Hugging Face token. Here is an example deployment file: | ||||
|
||||
.. code-block:: yaml | ||||
|
||||
apiVersion: v1 | ||||
kind: PersistentVolumeClaim | ||||
metadata: | ||||
name: mistral-7b | ||||
namespace: llm | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Let's just use the default namespace, or create a block of namespace for vLLM There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||
spec: | ||||
accessModes: | ||||
- ReadWriteOnce | ||||
resources: | ||||
requests: | ||||
storage: 50Gi | ||||
storageClassName: default | ||||
volumeMode: Filesystem | ||||
--- | ||||
apiVersion: v1 | ||||
kind: Secret | ||||
metadata: | ||||
name: hf-token-secret | ||||
namespace: llm | ||||
type: Opaque | ||||
data: | ||||
token: ******* | ||||
--- | ||||
apiVersion: apps/v1 | ||||
kind: Deployment | ||||
metadata: | ||||
name: mistral-7b | ||||
namespace: llm | ||||
labels: | ||||
app: mistral-7b | ||||
spec: | ||||
replicas: 1 | ||||
revisionHistoryLimit: 1 | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
this is extra and we don't need this, let's remove this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||
selector: | ||||
matchLabels: | ||||
app: mistral-7b | ||||
template: | ||||
metadata: | ||||
labels: | ||||
app: mistral-7b | ||||
spec: | ||||
volumes: | ||||
- name: cache-volume | ||||
persistentVolumeClaim: | ||||
claimName: mistral-7b | ||||
- name: shm | ||||
emptyDir: | ||||
medium: Memory | ||||
sizeLimit: "2Gi" | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment saying this is needed by vLLM There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||
containers: | ||||
- name: mistral-7b | ||||
image: vllm/vllm-openai:latest | ||||
command: ["/bin/sh", "-c"] | ||||
args: [ | ||||
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024" | ||||
] | ||||
env: | ||||
- name: HUGGING_FACE_HUB_TOKEN | ||||
valueFrom: | ||||
secretKeyRef: | ||||
name: hf-token-secret | ||||
key: token | ||||
- name: VLLM_NO_USAGE_STATS | ||||
value: "1" | ||||
- name: DO_NOT_TRACK | ||||
value: "1" | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can you comment this out, these are optional, since they really do help us prioritize or deprecate features 🥹 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed |
||||
ports: | ||||
- containerPort: 8000 | ||||
resources: | ||||
limits: | ||||
cpu: "10" | ||||
memory: 20G | ||||
nvidia.com/gpu: "1" | ||||
requests: | ||||
cpu: "2" | ||||
memory: 6G | ||||
nvidia.com/gpu: "1" | ||||
volumeMounts: | ||||
- mountPath: /root/.cache/huggingface | ||||
name: cache-volume | ||||
- name: shm | ||||
mountPath: /dev/shm | ||||
livenessProbe: | ||||
httpGet: | ||||
path: /health | ||||
port: 8000 | ||||
initialDelaySeconds: 60 | ||||
periodSeconds: 10 | ||||
readinessProbe: | ||||
httpGet: | ||||
path: /health | ||||
port: 8000 | ||||
initialDelaySeconds: 60 | ||||
periodSeconds: 5 | ||||
--- | ||||
|
||||
2. **Create a Kubernetes Service for vLLM** | ||||
|
||||
Next, create a Kubernetes Service file to expose the `mistral-7b` deployment: | ||||
|
||||
.. code-block:: yaml | ||||
|
||||
apiVersion: v1 | ||||
kind: Service | ||||
metadata: | ||||
name: mistral-7b | ||||
namespace: llm | ||||
spec: | ||||
internalTrafficPolicy: Cluster | ||||
ipFamilies: | ||||
- IPv4 | ||||
ipFamilyPolicy: SingleStack | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. remove these so they are left to default |
||||
ports: | ||||
- name: http-mistral-7b | ||||
port: 80 | ||||
protocol: TCP | ||||
targetPort: 8000 | ||||
selector: | ||||
app: mistral-7b | ||||
sessionAffinity: None | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add a note saying this is useful for prefix caching feature There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||||
type: ClusterIP | ||||
|
||||
3. **Deploy and Test** | ||||
|
||||
Apply the deployment and service configurations using ``kubectl apply -f <filename>``: | ||||
|
||||
.. code-block:: console | ||||
|
||||
kubectl apply -f deployment.yaml | ||||
kubectl apply -f service.yaml | ||||
|
||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. do we need port-forward here? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is no need to use port forwarding because I am accessing the LLM in the Kubernetes cluster via |
||||
To test the deployment, run the following ``curl`` command: | ||||
|
||||
.. code-block:: console | ||||
|
||||
curl http://mistral-7b.llm.svc.cluster.local/v1/completions \ | ||||
-H "Content-Type: application/json" \ | ||||
-d '{ | ||||
"model": "facebook/opt-125m", | ||||
"prompt": "San Francisco is a", | ||||
"max_tokens": 7, | ||||
"temperature": 0 | ||||
}' | ||||
|
||||
If the service is correctly deployed, you should receive a response from the vLLM model. | ||||
|
||||
Conclusion | ||||
---------- | ||||
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, update the docs for the PVC & Secret.