Skip to content

Commit 1806583

Browse files
zucchini-nlphmellorstevhliu
authored
[docs] Create page on inference servers with transformers backend (#39550)
* draft docs on inference servers * Update docs/source/en/_toctree.yml Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * update * dic build failed * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/_toctree.yml Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Update docs/source/en/transformers_as_backend.md Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com> * apply last suggestions --------- Co-authored-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>
1 parent cd98c1f commit 1806583

File tree

3 files changed

+260
-49
lines changed

3 files changed

+260
-49
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -72,8 +72,6 @@
7272
title: Caching
7373
- local: kv_cache
7474
title: KV cache strategies
75-
- local: serving
76-
title: Serving
7775
- local: llm_tutorial_optimization
7876
title: Getting the most out of LLMs
7977
- local: perplexity
@@ -105,6 +103,10 @@
105103
title: Agents
106104
- local: tools
107105
title: Tools
106+
- local: serving
107+
title: Serving
108+
- local: transformers_as_backend
109+
title: Inference server backends
108110
title: Inference
109111
- isExpanded: false
110112
sections:

docs/source/en/serving.md

Lines changed: 2 additions & 47 deletions
Original file line numberDiff line numberDiff line change
@@ -16,54 +16,9 @@ rendered properly in your Markdown viewer.
1616

1717
# Serving
1818

19-
Transformer models can be efficiently deployed using libraries such as vLLM, Text Generation Inference (TGI), and others. These libraries are designed for production-grade user-facing services, and can scale to multiple servers and millions of concurrent users.
19+
Transformer models can be efficiently deployed using libraries such as vLLM, Text Generation Inference (TGI), and others. These libraries are designed for production-grade user-facing services, and can scale to multiple servers and millions of concurrent users. Refer to [Transformers as Backend for Inference Servers](./transformers_as_backends) for usage examples.
2020

21-
You can also serve transformer models easily using the `transformers serve` CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.
22-
23-
## TGI
24-
25-
[TGI](https://huggingface.co/docs/text-generation-inference/index) can serve models that aren't [natively implemented](https://huggingface.co/docs/text-generation-inference/supported_models) by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren't available in the Transformers implementation, but other features like continuous batching and streaming are still supported.
26-
27-
> [!TIP]
28-
> Refer to the [Non-core model serving](https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models) guide for more details.
29-
30-
Serve a Transformers implementation the same way you'd serve a TGI model.
31-
32-
```docker
33-
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
34-
```
35-
36-
Add `--trust-remote_code` to the command to serve a custom Transformers model.
37-
38-
```docker
39-
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code
40-
```
41-
42-
## vLLM
43-
44-
[vLLM](https://docs.vllm.ai/en/latest/index.html) can also serve a Transformers implementation of a model if it isn't [natively implemented](https://docs.vllm.ai/en/latest/models/supported_models.html#list-of-text-only-language-models) in vLLM.
45-
46-
Many features like quantization, LoRA adapters, and distributed inference and serving are supported for the Transformers implementation.
47-
48-
> [!TIP]
49-
> Refer to the [Transformers fallback](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers-fallback) section for more details.
50-
51-
By default, vLLM serves the native implementation and if it doesn't exist, it falls back on the Transformers implementation. But you can also set `--model-impl transformers` to explicitly use the Transformers model implementation.
52-
53-
```shell
54-
vllm serve Qwen/Qwen2.5-1.5B-Instruct \
55-
--task generate \
56-
--model-impl transformers
57-
```
58-
59-
Add the `trust-remote-code` parameter to enable loading a remote code model.
60-
61-
```shell
62-
vllm serve Qwen/Qwen2.5-1.5B-Instruct \
63-
--task generate \
64-
--model-impl transformers \
65-
--trust-remote-code
66-
```
21+
Apart from that you can also serve transformer models easily using the `transformers serve` CLI. This is ideal for experimentation purposes, or to run models locally for personal and private use.
6722

6823
## Serve CLI
6924

Lines changed: 254 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,254 @@
1+
<!--Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
12+
⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
13+
rendered properly in your Markdown viewer.
14+
15+
-->
16+
17+
# Inference server backends
18+
19+
Transformers' models are compatible with different inference servers like vLLM and SGLang. Instead of implementing a model for each inference server, you only need one model, which can be plugged into any inference server. It simplifies maintenance and makes it easy for users to use different inference servers for different use cases.
20+
21+
With Transformers as a backend, you can also serve any model - including custom and Hub-hosted models - without waiting for native support.
22+
23+
This guide shows how to use Transformers' models as a backend to some popular inference servers and how to build a model that supports all inference servers.
24+
25+
## vLLM
26+
27+
[vLLM](https://github.com/vllm-project/vllm) is a high-performance inference engine optimized for serving LLMs at scale. It supports many Transformers' models, including all decoder-only LLMs and several vision-language models (VLMs). VLMs currently support image inputs only, with video support planned.
28+
29+
vLLM automatically selects the best backend, and if a model isn’t natively supported, it falls back to the Transformers model. To explicitly use a Transformers' model, set `model_impl="transformers"`.
30+
31+
```python
32+
from vllm import LLM
33+
llm = LLM(model="meta-llama/Llama-3.2-1B", model_impl="transformers")
34+
```
35+
Add `--model-impl transformers` to `vllm serve` to launch a server with a Transformers' model.
36+
37+
```bash
38+
vllm serve meta-llama/Llama-3.2-1B \
39+
--task generate \
40+
--model-impl transformers
41+
```
42+
43+
Refer to the [vLLM docs](https://docs.vllm.ai/en/latest/models/transformers_backend.html) for more usage examples and tips on using a Transformers as the backend.
44+
45+
46+
## SGLang
47+
48+
[SGLang](https://github.com/InternLM/sglang) is a high-performance, OpenAI-compatible server and runtime designed for chat-based LLMs. It offers fast inference, role-based conversation handling, and support for custom pipelines, making it great for building real-world LLM apps.
49+
50+
SGLang automatically falls back to the Transformers backend if a model isn’t natively supported. To explicitly use a Transformers' model, set `impl="transformers"`.
51+
52+
```python
53+
import sglang as sgl
54+
55+
llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", impl="transformers")
56+
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])
57+
```
58+
59+
Add `impl transformers` to `sglang.launch_server` to launch a server with a Transformers' model.
60+
61+
62+
63+
64+
65+
66+
67+
```bash
68+
python3 -m sglang.launch_server \
69+
--model-path kyutai/helium-1-preview-2b \
70+
--impl transformers \
71+
--host 0.0.0.0 \
72+
--port 30000
73+
```
74+
75+
Refer to the [SGLang docs](https://docs.sglang.ai/supported_models/transformers_fallback.html) for more usage examples and tips on using a Transformers as the backend.
76+
77+
## TGI
78+
79+
[TGI](https://huggingface.co/docs/text-generation-inference/index) can serve models that aren't [natively implemented](https://huggingface.co/docs/text-generation-inference/supported_models) by falling back on the Transformers implementation of the model. Some of TGIs high-performance features aren't available in the Transformers implementation, but other features like continuous batching and streaming are still supported.
80+
81+
> [!TIP]
82+
> Refer to the [Non-core model serving](https://huggingface.co/docs/text-generation-inference/basic_tutorials/non_core_models) guide for more details.
83+
84+
Serve a Transformers implementation the same way you'd serve a TGI model.
85+
86+
```docker
87+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id gpt2
88+
```
89+
90+
Add `--trust-remote_code` to the command to serve a custom Transformers model.
91+
92+
```docker
93+
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:latest --model-id <CUSTOM_MODEL_ID> --trust-remote-code
94+
```
95+
96+
## Building a compatible model backend
97+
98+
To ensure a model is compatible as a backend to any inference server, make sure it is compatible with Transformers and supports the [AttentionInterface](./attention_interface) class.
99+
100+
1. A model must be Transformers-compatible following the model [contribution guidelines](./add_new_model) or the [custom model contribution guidelines](./custom_models). Make sure the model has a valid `config.json` in its directory and a valid `auto_map` field pointing to the model class in the config.
101+
102+
2. A model's attentions needs to be configurable with the [AttentionInterface](./attention_interface) to allow custom and optimized attention functions. This is important for enabling the performance features of the different inference servers.
103+
Use `ALL_ATTENTION_FUNCTIONS` when defining the attention layer and propagate `**kwargs**` from the base `MyModel` class to the attention layers. Set `_supports_attention_backend` to `True` in [`PreTrainedModel`]. Expand the code below for an example.
104+
105+
<details>
106+
<summary>modeling_my_model.py</summary>
107+
108+
```python
109+
110+
from transformers import PreTrainedModel
111+
from torch import nn
112+
113+
class MyAttention(nn.Module):
114+
115+
def forward(self, hidden_states, **kwargs):
116+
...
117+
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
118+
attn_output, attn_weights = attention_interface(
119+
self,
120+
query_states,
121+
key_states,
122+
value_states,
123+
**kwargs,
124+
)
125+
...
126+
127+
class MyModel(PreTrainedModel):
128+
_supports_attention_backend = True
129+
```
130+
131+
</details>
132+
133+
3. This step is optional, but if you want to support tensor parallel and/or pipeline parallel features, add the following keys to the config.
134+
* `base_model_tp_plan` enables [tensor parallelism](./perf_infer_gpu_multi) by mapping fully qualified layer name patterns to tensor parallel styles. Only the `"colwise"` and `"rowwise"` partitioning strategies are currently supported.
135+
* `base_model_pp_plan` enables pipeline parallelism by mapping direct child layer names to tuples of lists of strings. The list in the first element of the tuple contains the names of the input arguments. The list in the last element of the tuple contains the names of the variables the layer outputs to in the modeling code.
136+
137+
Expand the code below for an example.
138+
139+
<details>
140+
<summary>configuration_my_model.py</summary>
141+
142+
```python
143+
144+
from transformers import PretrainedConfig
145+
146+
class MyConfig(PretrainedConfig):
147+
base_model_tp_plan = {
148+
"layers.*.self_attn.k_proj": "colwise",
149+
"layers.*.self_attn.v_proj": "colwise",
150+
"layers.*.self_attn.o_proj": "rowwise",
151+
"layers.*.mlp.gate_proj": "colwise",
152+
"layers.*.mlp.up_proj": "colwise",
153+
"layers.*.mlp.down_proj": "rowwise",
154+
}
155+
base_model_pp_plan = {
156+
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
157+
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
158+
"norm": (["hidden_states"], ["hidden_states"]),
159+
}
160+
```
161+
</details>
162+
163+
### Multimodal models
164+
165+
For multimodal models, you need to include a few more changes on top of the general recommendations. These rules ensure that your model integrates properly with multimodal data.
166+
167+
1. A multimodal model requires a base `MyMultiModalModel` class to handle multimodal fusion without a language modeling head and a separate generative class that adds a head.
168+
169+
The base model needs to implement the `get_image_features()` method to accept image pixel values and return encoded outputs. These are later merged with the language embeddings and don't require any postprocessing. The shape of the returned features must match the number of input images. If a vision encoder returns variable-length outputs (patch-based), return a list of 2D tensors of size `(image_seq_len, image_dim)` for each image.
170+
171+
Expand the code below for an example.
172+
173+
<details>
174+
<summary>modeling_my_multimodal_model.py</summary>
175+
176+
```python
177+
from transformers.generation import GenerationMixin
178+
179+
class MyMultimodalModel(MyMultimodalPreTrainedModel):
180+
def __init__(self, config):
181+
super().__init__(config)
182+
self.language_model = AutoModel.from_config(config.text_config)
183+
self.vision_tower = AutoModel.from_config(config.vision_config)
184+
self.multimodal_projection = nn.Linear(vision_dim, text_dim)
185+
186+
def get_image_features(self, pixel_values):
187+
return self.vision_tower(pixel_values).last_hidden_states
188+
189+
def forward(self, input_ids, pixel_values, **kwargs):
190+
# process your inputs
191+
return MyModelOutputWithPast(
192+
last_hidden_state=last_hidden_state,
193+
image_hidden_states=image_features,
194+
[...]
195+
)
196+
197+
class MyMultimodalModelForConditionalGeneration(MyMultimodalPreTrainedModel, GenerationMixin):
198+
def __init__(self, config):
199+
super().__init__(config)
200+
self.model = MyMultimodalModel(config)
201+
self.lm_head = nn.Linear(hidden_dim, vocab_size)
202+
```
203+
</details>
204+
205+
206+
2. A multimodal model config must be nested with the following fields.
207+
* text_config: decoder language model config
208+
* vision_config: vision encoder config
209+
* image_token_id: ID of the image placeholder token used in the input to indicate image position
210+
211+
3. A multimodal model's processing class must have the `self.image_token` and `self.image_token_ids` attributes. These are placeholder tokens used to indicate image positions in the input. The placeholder token is the same token used in the input prompt and to mask scatter image features.
212+
213+
The processing class also needs ` self._get_num_multimodal_tokens` method to compute the number of placeholder tokens needed for multimodal inputs with given sizes and to return a [`MultiModalData`] object. The placeholder for row and column tokens don't count as image placeholders. Only the tokens that are actually replaced by image features are computed.
214+
215+
Finally, when `return_mm_token_type_ids=True`, the class has to return `mm_token_type_ids` to indicate whether each position is a text token (`0`) or image placeholder token (`1`). Each image's token type IDs must be contiguous with no breaks between consecutive ones.
216+
217+
Expand the code below for an example.
218+
219+
<details>
220+
<summary>processing_my_multimodal_model.py</summary>
221+
222+
```python
223+
class MyMultimodalProcessor(ProcessorMixin):
224+
225+
def __call__(self, images=None, text=None, **kwargs):
226+
if return_mm_token_type_ids:
227+
mm_token_type_ids = np.zeros_like(input_ids)
228+
mm_token_type_ids[input_ids == self.image_token_id] = 1
229+
text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
230+
return BatchFeature(data={**text_inputs, **image_inputs}, tensor_type=return_tensors)
231+
232+
def _get_num_multimodal_tokens(self, image_sizes=None, **kwargs):
233+
"""
234+
Computes the number of placeholder tokens needed for multimodal inputs with the given sizes.
235+
Args:
236+
image_sizes (`list[list[int]]`, *optional*):
237+
The input sizes formatted as (height, width) per each image.
238+
Returns:
239+
`MultiModalData`: A `MultiModalData` object holding number of tokens per each of the provided
240+
input modalities, along with other useful data.
241+
"""
242+
vision_data = {}
243+
if image_sizes is not None:
244+
num_image_tokens = [256] * len(image_sizes) # 256 placeholder tokens for each image always
245+
num_image_patches = [1] * len(image_sizes) # no patching, thus each image is processed as a single base image
246+
vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches})
247+
return MultiModalData(**vision_data)
248+
```
249+
</details>
250+
251+
## Resources
252+
253+
* Read the [Transformers backend integration in vLLM](https://blog.vllm.ai/2025/04/11/transformers-backend.html) blog post for more details about the Transformers backend in vLLM.
254+
* Read the [Transformers backend integration in SGLang](https://huggingface.co/blog/transformers-backend-sglang) blog post for more details about the Transformers backend in SGLang.

0 commit comments

Comments
 (0)