Skip to content

Commit a52315f

Browse files
committed
Support TGI as another backendruntime
Signed-off-by: kerthcet <kerthcet@gmail.com>
1 parent 8841b24 commit a52315f

File tree

10 files changed

+121
-5
lines changed

10 files changed

+121
-5
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Easy, advanced inference platform for large language models on Kubernetes
2727
## Features Overview
2828

2929
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
30-
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
30+
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
3131
- **Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
3232
- **Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
3333
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.

chart/templates/backends/tgi.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{{- if .Values.backendRuntime.install -}}
2+
apiVersion: inference.llmaz.io/v1alpha1
3+
kind: BackendRuntime
4+
metadata:
5+
labels:
6+
app.kubernetes.io/name: backendruntime
7+
app.kubernetes.io/part-of: llmaz
8+
app.kubernetes.io/created-by: llmaz
9+
name: tgi
10+
spec:
11+
image: ghcr.io/huggingface/text-generation-inference
12+
version: 2.3.1
13+
# Do not edit the preset argument name unless you know what you're doing.
14+
# Free to add more arguments with your requirements.
15+
args:
16+
- name: default
17+
flags:
18+
- --model-id
19+
- "{{`{{ .ModelPath }}`}}"
20+
- --port
21+
- "8080"
22+
resources:
23+
requests:
24+
cpu: 4
25+
memory: 8Gi
26+
limits:
27+
cpu: 4
28+
memory: 8Gi
29+
{{- end }}

chart/values.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -32,8 +32,8 @@ controllerManager:
3232
drop:
3333
- ALL
3434
image:
35-
repository: inftyai/llmaz
36-
tag: v0.0.7
35+
repository: inftyai/llmaz-test
36+
tag: 1010-03
3737
resources:
3838
limits:
3939
cpu: 500m

config/manager/kustomization.yaml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,5 +4,5 @@ apiVersion: kustomize.config.k8s.io/v1beta1
44
kind: Kustomization
55
images:
66
- name: controller
7-
newName: inftyai/llmaz
8-
newTag: v0.0.7
7+
newName: inftyai/llmaz-test
8+
newTag: 1010-03

docs/examples/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ We provide a set of examples to help you serve large language models, by default
99
- [Deploy models from ObjectStore](#deploy-models-from-objectstore)
1010
- [Deploy models via SGLang](#deploy-models-via-sglang)
1111
- [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
12+
- [Deploy models via text-generation-inference](#deploy-models-via-llamacpp)
1213
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
1314

1415
### Deploy models from Huggingface
@@ -41,6 +42,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
4142

4243
[llama.cpp](https://github.com/ggerganov/llama.cpp) can serve models on a wide variety of hardwares, such as CPU, see [example](./llamacpp/) here.
4344

45+
### Deploy models via TGI
46+
47+
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
48+
4449
### Speculative Decoding with vLLM
4550

4651
[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.

docs/examples/tgi/model.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
apiVersion: llmaz.io/v1alpha1
2+
kind: OpenModel
3+
metadata:
4+
name: qwen2-0--5b
5+
spec:
6+
familyName: qwen2
7+
source:
8+
modelHub:
9+
modelID: Qwen/Qwen2-0.5B-Instruct
10+
inferenceFlavors:
11+
- name: t4 # GPU type
12+
requests:
13+
nvidia.com/gpu: 1

docs/examples/tgi/playground.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: Playground
3+
metadata:
4+
name: qwen2-0--5b
5+
spec:
6+
replicas: 1
7+
modelClaim:
8+
modelName: qwen2-0--5b
9+
backendRuntimeConfig:
10+
name: tgi

docs/support-backends.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,10 @@
88

99
[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
1010

11+
## Text-Generation-Inference
12+
13+
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
14+
1115
## vLLM
1216

1317
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs

test/config/backends/tgi.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: BackendRuntime
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: backendruntime
6+
app.kubernetes.io/part-of: llmaz
7+
app.kubernetes.io/created-by: llmaz
8+
name: tgi
9+
spec:
10+
image: ghcr.io/huggingface/text-generation-inference
11+
version: 2.3.1
12+
# Do not edit the preset argument name unless you know what you're doing.
13+
# Free to add more arguments with your requirements.
14+
args:
15+
- name: default
16+
flags:
17+
- --model-id
18+
- "{{`{{ .ModelPath }}`}}"
19+
- --port
20+
- "8080"
21+
resources:
22+
requests:
23+
cpu: 4
24+
memory: 8Gi
25+
limits:
26+
cpu: 4
27+
memory: 8Gi

test/integration/controller/inference/playground_test.go

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,34 @@ var _ = ginkgo.Describe("playground controller test", func() {
236236
},
237237
},
238238
}),
239+
ginkgo.Entry("advance configured Playground with tgi", &testValidatingCase{
240+
makePlayground: func() *inferenceapi.Playground {
241+
return wrapper.MakePlayground("playground", ns.Name).ModelClaim(model.Name).Label(coreapi.ModelNameLabelKey, model.Name).
242+
BackendRuntime("tgi").BackendRuntimeVersion("main").BackendRuntimeArgs([]string{"--model-id", "Qwen/Qwen2-0.5B-Instruct"}).BackendRuntimeEnv("FOO", "BAR").
243+
BackendRuntimeRequest("cpu", "1").BackendRuntimeLimit("cpu", "10").
244+
Obj()
245+
},
246+
updates: []*update{
247+
{
248+
updateFunc: func(playground *inferenceapi.Playground) {
249+
gomega.Expect(k8sClient.Create(ctx, playground)).To(gomega.Succeed())
250+
},
251+
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
252+
validation.ValidatePlayground(ctx, k8sClient, playground)
253+
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundProgressing, "Pending", metav1.ConditionTrue)
254+
},
255+
},
256+
{
257+
updateFunc: func(playground *inferenceapi.Playground) {
258+
util.UpdateLwsToReady(ctx, k8sClient, playground.Name, playground.Namespace)
259+
},
260+
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
261+
validation.ValidatePlayground(ctx, k8sClient, playground)
262+
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundAvailable, "PlaygroundReady", metav1.ConditionTrue)
263+
},
264+
},
265+
},
266+
}),
239267
ginkgo.Entry("playground is created when service exists with the same name", &testValidatingCase{
240268
makePlayground: func() *inferenceapi.Playground {
241269
return util.MockASamplePlayground(ns.Name)

0 commit comments

Comments
 (0)