Skip to content

[Doc]: Add deploying_with_k8s guide #8451

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Oct 7, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ Documentation

serving/openai_compatible_server
serving/deploying_with_docker
serving/deploying_with_k8s
serving/distributed_serving
serving/metrics
serving/env_vars
Expand Down
173 changes: 173 additions & 0 deletions docs/source/serving/deploying_with_k8s.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,173 @@
.. _deploying_with_k8s:

Deploying with Kubernetes
==========================

Using Kubernetes to deploy vLLM is a scalable and efficient way to serve machine learning models. This guide will walk you through the process of deploying vLLM with Kubernetes, including the necessary prerequisites, steps for deployment, and testing.

Prerequisites
-------------
Before you begin, ensure that you have the following:

- A running Kubernetes cluster
- NVIDIA Kubernetes Device Plugin (`k8s-device-plugin`): This can be found at `https://github.com/NVIDIA/k8s-device-plugin/`
- Available GPU resources in your cluster

Deployment Steps
----------------

1. **Create a PVC , Secret and Deployment for vLLM**

Create a Kubernetes deployment file that includes a PersistentVolumeClaim (PVC) and a Secret necessary for pulling the Hugging Face token. Here is an example deployment file:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • PVC should be optional? After, all the model can be downloaded. Please say it is optional but recommended.
  • Secret is optional as well. Only for models that are gated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, update the docs for the PVC & Secret.


.. code-block:: yaml

apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: mistral-7b
namespace: llm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just use the default namespace, or create a block of namespace for vLLM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: default
volumeMode: Filesystem
---
apiVersion: v1
kind: Secret
metadata:
name: hf-token-secret
namespace: llm
type: Opaque
data:
token: *******
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: mistral-7b
namespace: llm
labels:
app: mistral-7b
spec:
replicas: 1
revisionHistoryLimit: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
revisionHistoryLimit: 1

this is extra and we don't need this, let's remove this

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

selector:
matchLabels:
app: mistral-7b
template:
metadata:
labels:
app: mistral-7b
spec:
volumes:
- name: cache-volume
persistentVolumeClaim:
claimName: mistral-7b
- name: shm
emptyDir:
medium: Memory
sizeLimit: "2Gi"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment saying this is needed by vLLM

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

containers:
- name: mistral-7b
image: vllm/vllm-openai:latest
command: ["/bin/sh", "-c"]
args: [
"vllm serve mistralai/Mistral-7B-Instruct-v0.3 --trust-remote-code --enable-chunked-prefill --max_num_batched_tokens 1024"
]
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: VLLM_NO_USAGE_STATS
value: "1"
- name: DO_NOT_TRACK
value: "1"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment this out, these are optional, since they really do help us prioritize or deprecate features 🥹

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

ports:
- containerPort: 8000
resources:
limits:
cpu: "10"
memory: 20G
nvidia.com/gpu: "1"
requests:
cpu: "2"
memory: 6G
nvidia.com/gpu: "1"
volumeMounts:
- mountPath: /root/.cache/huggingface
name: cache-volume
- name: shm
mountPath: /dev/shm
livenessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 10
readinessProbe:
httpGet:
path: /health
port: 8000
initialDelaySeconds: 60
periodSeconds: 5
---

2. **Create a Kubernetes Service for vLLM**

Next, create a Kubernetes Service file to expose the `mistral-7b` deployment:

.. code-block:: yaml

apiVersion: v1
kind: Service
metadata:
name: mistral-7b
namespace: llm
spec:
internalTrafficPolicy: Cluster
ipFamilies:
- IPv4
ipFamilyPolicy: SingleStack
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove these so they are left to default

ports:
- name: http-mistral-7b
port: 80
protocol: TCP
targetPort: 8000
selector:
app: mistral-7b
sessionAffinity: None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a note saying this is useful for prefix caching feature

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

type: ClusterIP

3. **Deploy and Test**

Apply the deployment and service configurations using ``kubectl apply -f <filename>``:

.. code-block:: console

kubectl apply -f deployment.yaml
kubectl apply -f service.yaml

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need port-forward here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to use port forwarding because I am accessing the LLM in the Kubernetes cluster via mistral-7b.llm.svc.cluster.local

To test the deployment, run the following ``curl`` command:

.. code-block:: console

curl http://mistral-7b.llm.svc.cluster.local/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "facebook/opt-125m",
"prompt": "San Francisco is a",
"max_tokens": 7,
"temperature": 0
}'

If the service is correctly deployed, you should receive a response from the vLLM model.

Conclusion
----------
Deploying vLLM with Kubernetes allows for efficient scaling and management of ML models leveraging GPU resources. By following the steps outlined above, you should be able to set up and test a vLLM deployment within your Kubernetes cluster. If you encounter any issues or have suggestions, please feel free to contribute to the documentation.
Loading