You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -27,7 +27,7 @@ Easy, advanced inference platform for large language models on Kubernetes
27
27
## Features Overview
28
28
29
29
-**Easy of Use**: People can quick deploy a LLM service with minimal configurations.
30
-
-**Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
30
+
-**Broad Backend Support**: llmaz supports a wide range of advanced inference backends for different scenarios, like [vLLM](https://github.com/vllm-project/vllm), [Text-Generation-Inference](https://github.com/huggingface/text-generation-inference), [SGLang](https://github.com/sgl-project/sglang), [llama.cpp](https://github.com/ggerganov/llama.cpp). Find the full list of supported backends [here](./docs/support-backends.md).
31
31
-**Scaling Efficiency (WIP)**: llmaz works smoothly with autoscaling components like [Cluster-Autoscaler](https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler) or [Karpenter](https://github.com/kubernetes-sigs/karpenter) to support elastic scenarios.
32
32
-**Accelerator Fungibility (WIP)**: llmaz supports serving the same LLM with various accelerators to optimize cost and performance.
33
33
-**SOTA Inference**: llmaz supports the latest cutting-edge researches like [Speculative Decoding](https://arxiv.org/abs/2211.17192) or [Splitwise](https://arxiv.org/abs/2311.18677)(WIP) to run on Kubernetes.
Copy file name to clipboardExpand all lines: docs/examples/README.md
+5Lines changed: 5 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -9,6 +9,7 @@ We provide a set of examples to help you serve large language models, by default
9
9
-[Deploy models from ObjectStore](#deploy-models-from-objectstore)
10
10
-[Deploy models via SGLang](#deploy-models-via-sglang)
11
11
-[Deploy models via llama.cpp](#deploy-models-via-llamacpp)
12
+
-[Deploy models via text-generation-inference](#deploy-models-via-llamacpp)
12
13
-[Speculative Decoding with vLLM](#speculative-decoding-with-vllm)
13
14
14
15
### Deploy models from Huggingface
@@ -41,6 +42,10 @@ By default, we use [vLLM](https://github.com/vllm-project/vllm) as the inference
41
42
42
43
[llama.cpp](https://github.com/ggerganov/llama.cpp) can serve models on a wide variety of hardwares, such as CPU, see [example](./llamacpp/) here.
43
44
45
+
### Deploy models via TGI
46
+
47
+
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint. see [example](./tgi/) here.
48
+
44
49
### Speculative Decoding with vLLM
45
50
46
51
[Speculative Decoding](https://arxiv.org/abs/2211.17192) can improve inference performance efficiently, see [example](./speculative-decoding/vllm/) here.
Copy file name to clipboardExpand all lines: docs/support-backends.md
+4Lines changed: 4 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,10 @@
8
8
9
9
[SGLang](https://github.com/sgl-project/sglang) is yet another fast serving framework for large language models and vision language models.
10
10
11
+
## Text-Generation-Inference
12
+
13
+
[text-generation-inference](https://github.com/huggingface/text-generation-inference) is a Rust, Python and gRPC server for text generation inference. Used in production at Hugging Face to power Hugging Chat, the Inference API and Inference Endpoint.
14
+
11
15
## vLLM
12
16
13
17
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput and memory-efficient inference and serving engine for LLMs
0 commit comments