AI inference: demonstrate in-cluster storage of models#575
AI inference: demonstrate in-cluster storage of models#575justinsb wants to merge 1 commit intokubernetes:masterfrom
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: justinsb The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
287ae85 to
97a98d7
Compare
This example demonstrates how we can serve models from inside the cluster, without needing to bake them into the container images. We may also in future want to support storing models in GCS or S3, but this example focuses on storing models without cloud dependencies. We may also want to investigate serving models from container images, particularly given the upcoming support for mounting container images as volumes, but this approach works today and allows for more dynamic model loading (e.g. loading new models without restarting pods). Moreover, a container image server is backed by a blob server, as introduced here.
97a98d7 to
e7a7cac
Compare
|
Heavily inspired by @seans3 's work in vllm-deployment example! And now with a readme (with similar inspiration). Looks like we aren't checking copyright headers so I will look into adding that in a separate PR |
|
/assign |
| on AI-conformant kubernetes clusters. | ||
|
|
||
| We (aspirationally) aim to demonstrate the capabilities of the AI-conformance | ||
| profile. Where we cannot achieve production-grade inference, we hope to |
There was a problem hiding this comment.
nit: remove "profile" everywhere. We were suggested not to use the term "profile" for Kubernetes AI conformance, given that there was historically an effort to define subsets (not supersets) of Kubernetes Conformance with this term.
| def get_image_prefix(): | ||
| """Constructs the image prefix for a container image.""" | ||
| project_id = get_gcp_project() | ||
| return f"gcr.io/{project_id}/" |
There was a problem hiding this comment.
nit: adopt the same change in kubernetes-sigs/agent-sandbox#13 with changes like supporting IMAGE_PREFIX env etc.
| # gemma3-6cf4765df9-c4nmt gemma3 DEBUG 09-08 14:57:56 [__init__.py:99] CUDA platform is not available because: NVML Shared Library Not Found | ||
|
|
||
|
|
||
| # FROM vllm/vllm-openai:v0.10.0 |
There was a problem hiding this comment.
nit: remove commented out part of it's not needed
|
|
||
| 1. `blob-server`, a statefulset with a persistent volume to hold the model blobs (files) | ||
|
|
||
| 1. `gemma3`, a deployment running vLLM, with a frontend go process that will download the model from `blob-server`. |
There was a problem hiding this comment.
Do we need to merge this with https://github.com/kubernetes/examples/tree/master/AI/vllm-deployment? The other one doesn't provide persistent model storage.
|
|
||
| ```bash | ||
| kubectl delete deployment gemma3 | ||
| kubectl delete statefulset blob-server |
There was a problem hiding this comment.
Need to delete the PVC as well for full cleanup
| selector: | ||
| matchLabels: | ||
| app: blob-server | ||
| #serviceName: blob-server |
There was a problem hiding this comment.
nit: remove given that it's not needed
This example demonstrates how we can serve models from inside the cluster,
without needing to bake them into the container images,
or rely on pulling them from services like huggingface.
We may also in future want to support storing models in GCS or S3,
but this example focuses on storing models without cloud dependencies.
We may also want to investigate serving models from container images,
particularly given the upcoming support for mounting container images
as volumes, but this approach works today and allows for more
dynamic model loading (e.g. loading new models without restarting pods).
Moreover, a container image server is backed by a blob server,
as introduced here.