Skip to content

Commit e658a4e

Browse files
Merge pull request #182 from kerthcet/feat/support-tgi
Support TGI as another backendRuntime
2 parents 8841b24 + b5d8563 commit e658a4e

File tree

8 files changed

+119
-1
lines changed

8 files changed

+119
-1
lines changed

README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Easy, advanced inference platform for large language models on Kubernetes
2727
## Features Overview
2828

2929
- **Easy of Use**: People can quick deploy a LLM service with minimal configurations.
30-
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
30+
- **Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
3131
- **Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
3232
- **Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
3333
- **SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.

chart/templates/backends/tgi.yaml

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
{{- if .Values.backendRuntime.install -}}
2+
apiVersion: inference.llmaz.io/v1alpha1
3+
kind: BackendRuntime
4+
metadata:
5+
labels:
6+
app.kubernetes.io/name: backendruntime
7+
app.kubernetes.io/part-of: llmaz
8+
app.kubernetes.io/created-by: llmaz
9+
name: tgi
10+
spec:
11+
image: ghcr.io/huggingface/text-generation-inference
12+
version: 2.3.1
13+
# Do not edit the preset argument name unless you know what you're doing.
14+
# Free to add more arguments with your requirements.
15+
args:
16+
- name: default
17+
flags:
18+
- --model-id
19+
- "{{`{{ .ModelPath }}`}}"
20+
- --port
21+
- "8080"
22+
resources:
23+
requests:
24+
cpu: 4
25+
memory: 8Gi
26+
limits:
27+
cpu: 4
28+
memory: 8Gi
29+
{{- end }}

docs/examples/README.md

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ We provide a set of examples to help you serve large language models, by default
99
- [Deploy models from ObjectStore](#deploy-models-from-objectstore)
1010
- [Deploy models via SGLang](#deploy-models-via-sglang)
1111
- [Deploy models via llama.cpp](#deploy-models-via-llamacpp)
12+
- [Deploy models via text-generation-inference](#deploy-models-via-tgi)
1213
- [Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
1314

1415
### Deploy models from Huggingface
@@ -41,6 +42,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
4142

4243
[llama.cpp](https://github.com/ggerganov/llama.cpp) can serve models on a wide variety of hardwares, such as CPU, see [example](./llamacpp/) here.
4344

45+
### Deploy models via text-generation-inference
46+
47+
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
48+
4449
### Speculative Decoding with vLLM
4550

4651
[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.

docs/examples/tgi/model.yaml

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
apiVersion: llmaz.io/v1alpha1
2+
kind: OpenModel
3+
metadata:
4+
name: qwen2-0--5b
5+
spec:
6+
familyName: qwen2
7+
source:
8+
modelHub:
9+
modelID: Qwen/Qwen2-0.5B-Instruct
10+
inferenceFlavors:
11+
- name: t4 # GPU type
12+
requests:
13+
nvidia.com/gpu: 1

docs/examples/tgi/playground.yaml

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: Playground
3+
metadata:
4+
name: qwen2-0--5b
5+
spec:
6+
replicas: 1
7+
modelClaim:
8+
modelName: qwen2-0--5b
9+
backendRuntimeConfig:
10+
name: tgi

docs/support-backends.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# All Kinds of Supported Inference Backends
22

3+
If you want to integrate more backends into llmaz, please refer to this [PR](https://github.com/InftyAI/llmaz/pull/182). It's always welcomed.
4+
35
## llama.cpp
46

57
[llama.cpp](https://github.com/ggerganov/llama.cpp) is to enable LLM inference with minimal setup and state-of-the-art performance on a wide variety of hardware - locally and in the cloud.
@@ -8,6 +10,10 @@
810

911
[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
1012

13+
## Text-Generation-Inference
14+
15+
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
16+
1117
## vLLM
1218

1319
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs

test/config/backends/tgi.yaml

Lines changed: 27 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,27 @@
1+
apiVersion: inference.llmaz.io/v1alpha1
2+
kind: BackendRuntime
3+
metadata:
4+
labels:
5+
app.kubernetes.io/name: backendruntime
6+
app.kubernetes.io/part-of: llmaz
7+
app.kubernetes.io/created-by: llmaz
8+
name: tgi
9+
spec:
10+
image: ghcr.io/huggingface/text-generation-inference
11+
version: 2.3.1
12+
# Do not edit the preset argument name unless you know what you're doing.
13+
# Free to add more arguments with your requirements.
14+
args:
15+
- name: default
16+
flags:
17+
- --model-id
18+
- "{{`{{ .ModelPath }}`}}"
19+
- --port
20+
- "8080"
21+
resources:
22+
requests:
23+
cpu: 4
24+
memory: 8Gi
25+
limits:
26+
cpu: 4
27+
memory: 8Gi

test/integration/controller/inference/playground_test.go

Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -236,6 +236,34 @@ var _ = ginkgo.Describe("playground controller test", func() {
236236
},
237237
},
238238
}),
239+
ginkgo.Entry("advance configured Playground with tgi", &testValidatingCase{
240+
makePlayground: func() *inferenceapi.Playground {
241+
return wrapper.MakePlayground("playground", ns.Name).ModelClaim(model.Name).Label(coreapi.ModelNameLabelKey, model.Name).
242+
BackendRuntime("tgi").BackendRuntimeVersion("main").BackendRuntimeArgs([]string{"--model-id", "Qwen/Qwen2-0.5B-Instruct"}).BackendRuntimeEnv("FOO", "BAR").
243+
BackendRuntimeRequest("cpu", "1").BackendRuntimeLimit("cpu", "10").
244+
Obj()
245+
},
246+
updates: []*update{
247+
{
248+
updateFunc: func(playground *inferenceapi.Playground) {
249+
gomega.Expect(k8sClient.Create(ctx, playground)).To(gomega.Succeed())
250+
},
251+
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
252+
validation.ValidatePlayground(ctx, k8sClient, playground)
253+
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundProgressing, "Pending", metav1.ConditionTrue)
254+
},
255+
},
256+
{
257+
updateFunc: func(playground *inferenceapi.Playground) {
258+
util.UpdateLwsToReady(ctx, k8sClient, playground.Name, playground.Namespace)
259+
},
260+
checkFunc: func(ctx context.Context, k8sClient client.Client, playground *inferenceapi.Playground) {
261+
validation.ValidatePlayground(ctx, k8sClient, playground)
262+
validation.ValidatePlaygroundStatusEqualTo(ctx, k8sClient, playground, inferenceapi.PlaygroundAvailable, "PlaygroundReady", metav1.ConditionTrue)
263+
},
264+
},
265+
},
266+
}),
239267
ginkgo.Entry("playground is created when service exists with the same name", &testValidatingCase{
240268
makePlayground: func() *inferenceapi.Playground {
241269
return util.MockASamplePlayground(ns.Name)

0 commit comments

Comments
 (0)