Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 9 additions & 2 deletions docs/source/en/_toctree.yml
Original file line number Diff line number Diff line change
Expand Up @@ -123,8 +123,6 @@
title: Agents
- local: tools
title: Tools
- local: transformers_as_backend
title: Transformers as modeling backend
title: Inference
- isExpanded: false
sections:
Expand Down Expand Up @@ -224,6 +222,15 @@
- local: quantization/contribute
title: Contribute
title: Quantization
- isExpanded: false
sections:
- local: community_integrations/vllm
title: vLLM
- local: community_integrations/sglang
title: SGLang
- local: community_integrations/transformers_as_backend
title: Building a compatible model backend for inference
title: Community integrations
- isExpanded: false
sections:
- local: serialization
Expand Down
52 changes: 52 additions & 0 deletions docs/source/en/community_integrations/sglang.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# SGLang
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe @SunMarc can take a look at the SGLang section please 🙂


[SGLang](https://docs.sglang.ai) is a low-latency, high-throughput inference engine for large language models (LLMs). It also includes a frontend language for building agentic workflows.

Set `model-impl="transformers"` to load a Transformers modeling backend.

```py
import sglang as sgl

llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", model-impl="transformers")
print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0])
```

Pass `--model-impl transformers` to the `sglang.launch_server` command for online serving.

```bash
python3 -m sglang.launch_server \
--model-path meta-llama/Llama-3.2-1B-Instruct \
--model-impl transformers \
--host 0.0.0.0 \
--port 30000
```

Setting `model-impl="transformers"` tells SGLang to skip its native model matching and use the `TransformersModel` backend instead. [`PretrainedConfig.from_pretrained`] loads the config and [`AutoModel.config`] resolves the model class.

During loading, `_attn_implementation` is set to `"sglang"`. This routes attention calls through SGLang. RadixAttention kernels replace standard attention layers. SGLang's parallel linear class replaces linear layers to support tensor parallelism. The model benefits from all SGLang optimizations.

> [!WARNING]
> Compatible models require `_supports_attention_backend=True` so SGLang can control attention execution. See the [Building a compatible model backend for inference](./transformers_as_backend#model-implementation) guide for details.

The [load_weights](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/transformers.py#L277) function populates the model with weights.

## Resources

- [SGLang docs](https://docs.sglang.ai/supported_models/transformers_fallback.html) has more usage examples and tips for using Transformers as a backend.
- [Transformers backend integration in SGLang](https://huggingface.co/blog/transformers-backend-sglang) blog post explains what this integration enables.
140 changes: 140 additions & 0 deletions docs/source/en/community_integrations/transformers_as_backend.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,140 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# Building a compatible model backend for inference

Transformers models are compatible with inference engines like [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://docs.sglang.ai). Use the same Transformers model anywhere and avoid reimplementing a model from scratch for each inference engine. Models in Transformers that aren't natively supported by either inference engine work too.

This guide shows you how to implement a model in Transformers that works as a backend for any inference engine.

## Model implementation

1. Follow the model [contribution guidelines](./add_new_model) or the [custom model contribution guidelines](./custom_models). The model must have a valid `config.json` in its directory and a valid `auto_map` field pointing to the model class in the config.

2. Use the [`AttentionInterface`] class for custom and optimized attention functions. This interface unlocks each inference engine's performance features.

Use `ALL_ATTENTION_FUNCTIONS` when defining the attention layer and propagate `**kwargs**` from the base `MyModel` class to the attention layers. Set `_supports_attention_backend` to `True` in [`PreTrainedModel`].

Expand the code below for an example.

<details>
<summary>modeling_my_model.py</summary>

```python
from transformers import PreTrainedModel
from torch import nn

class MyAttention(nn.Module):

def forward(self, hidden_states, **kwargs):
...
attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation]
attn_output, attn_weights = attention_interface(
self,
query_states,
key_states,
value_states,
**kwargs,
)
...

class MyModel(PreTrainedModel):
_supports_attention_backend = True
```

</details>

3. Enable optional tensor or pipeline parallelism by adding the following keys to [`PreTrainedConfig`].

* `base_model_tp_plan` enables [tensor parallelism](./perf_infer_gpu_multi) by mapping fully qualified layer name patterns to tensor parallel styles. Supports only the `"colwise"` and `"rowwise"` partitioning strategies.
* `base_model_pp_plan` enables pipeline parallelism by mapping direct child layer names to tuples of lists of strings. The first element of the tuple contains the names of the input arguments. The last element contains the variable names of the layer outputs in the modeling code.

Expand the code below for an example.

<details>
<summary>configuration_my_model.py</summary>

```python

from transformers import PreTrainedConfig

class MyConfig(PreTrainedConfig):
base_model_tp_plan = {
"layers.*.self_attn.k_proj": "colwise",
"layers.*.self_attn.v_proj": "colwise",
"layers.*.self_attn.o_proj": "rowwise",
"layers.*.mlp.gate_proj": "colwise",
"layers.*.mlp.up_proj": "colwise",
"layers.*.mlp.down_proj": "rowwise",
}
base_model_pp_plan = {
"embed_tokens": (["input_ids"], ["inputs_embeds"]),
"layers": (["hidden_states", "attention_mask"], ["hidden_states"]),
"norm": (["hidden_states"], ["hidden_states"]),
}
```

</details>

## Multimodal models

Multimodal models require additional changes beyond the [vision language model contribution checklist](./contributing#vision-language-model-contribution-checklist). These changes ensure multimodal inputs are properly processed.

1. The [`ProcessorMixin`] class must include the `self.image_token` and `self.image_token_ids` attributes. These placeholder tokens indicate image positions in the input. The same token appears in the input prompt for images and in the model code to scatter image features.

2. The [`ProcessorMixin`] class must include a `self._get_num_multimodal_tokens` method. This method computes the number of placeholder tokens required for multimodal inputs with given sizes. It returns a [`MultiModalData`] object. Placeholders between `<image>` tokens, such as row or column tokens, don't count as image placeholders. Count only tokens replaced by image features later in the modeling code.

3. The [`ProcessorMixin`] class must check the value of `return_mm_token_type_ids` and return `mm_token_type_ids`. This indicates whether each position is a text token (`0`), image placeholder token (`1`), or a video placeholder token (`2`). Multimodal token type id sequences must be contiguous with no breaks between consecutive tokens. Treat special tokens for beginning, ending, row, and column tokens as placeholders.

Expand the code below for an example.

<details>
<summary>modeling_my_multimodal_model.py</summary>

```python
class MyMultimodalProcessor(ProcessorMixin):

def __call__(self, images=None, text=None, **kwargs):
if return_mm_token_type_ids:
mm_token_type_ids = np.zeros_like(input_ids)
mm_token_type_ids[input_ids == self.image_token_id] = 1
text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist()
return BatchFeature(data={**text_inputs, **image_inputs}, tensor_type=return_tensors)

def _get_num_multimodal_tokens(self, image_sizes=None, **kwargs):
"""
Computes the number of placeholder tokens needed for multimodal inputs with the given sizes.
Args:
image_sizes (`list[list[int]]`, *optional*):
The input sizes formatted as (height, width) per each image.
Returns:
`MultiModalData`: A `MultiModalData` object holding number of tokens per each of the provided
input modalities, along with other useful data.
"""
vision_data = {}
if image_sizes is not None:
num_image_tokens = [256] * len(image_sizes) # 256 placeholder tokens for each image always
num_image_patches = [1] * len(image_sizes) # no patching, thus each image is processed as a single base image
vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches})
return MultiModalData(**vision_data)
```

</details>

## Resources

* Read the [Transformers backend integration in vLLM](https://blog.vllm.ai/2025/04/11/transformers-backend.html) blog post for more details.
* Read the [Transformers backend integration in SGLang](https://huggingface.co/blog/transformers-backend-sglang) blog post for more details.
47 changes: 47 additions & 0 deletions docs/source/en/community_integrations/vllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
<!--Copyright 2025 The HuggingFace Team. All rights reserved.

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
specific language governing permissions and limitations under the License.

⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be
rendered properly in your Markdown viewer.

-->

# vLLM
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hmellor, would you mind taking a look at the vLLM section please? 🙂


[vLLM](https://github.com/vllm-project/vllm) is a high-throughput inference engine for serving LLMs at scale. It continuously batches requests and keeps KV cache memory compact with PagedAttention.

Set `model_impl="transformers"` to load a model using the Transformers modeling backend.

```py
from vllm import LLM

llm = LLM(model="meta-llama/Llama-3.2-1B", model_impl="transformers")
print(llm.generate(["The capital of France is"]))
```

Pass `--model-impl transformers` to the `vllm serve` command for online serving.

```bash
vllm serve meta-llama/Llama-3.2-1B \
--task generate \
--model-impl transformers
```

vLLM uses [`AutoConfig.from_pretrained`] to load a model's `config.json` file from the Hub or your Hugging Face cache. It checks the `architectures` field against its internal model registry to determine which vLLM model class to load. If the model isn't in the registry, vLLM calls [`AutoModel.from_config`] to load the Transformers model implementation.

Setting `model_impl="transformers"` bypasses the vLLM model registry and loads directly from Transformers. vLLM replaces most model modules (MoE, attention, linear, etc.) with its own optimized versions.

[`AutoTokenizer.from_pretrained`] loads tokenizer files. vLLM caches some tokenizer internals to reduce overhead during inference. Model weights download from the Hub in safetensors format.

## Resources

- [vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips.
- [Integration with Hugging Face](https://docs.vllm.ai/en/latest/design/huggingface_integration/) explains how vLLM integrates with Transformers.
Loading