-
Notifications
You must be signed in to change notification settings - Fork 31.5k
[docs] inference engines #42932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
stevhliu
wants to merge
2
commits into
huggingface:main
Choose a base branch
from
stevhliu:community-integrations
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+248
−208
Open
[docs] inference engines #42932
Changes from all commits
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,52 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| # SGLang | ||
|
|
||
| [SGLang](https://docs.sglang.ai) is a low-latency, high-throughput inference engine for large language models (LLMs). It also includes a frontend language for building agentic workflows. | ||
|
|
||
| Set `model-impl="transformers"` to load a Transformers modeling backend. | ||
|
|
||
| ```py | ||
| import sglang as sgl | ||
|
|
||
| llm = sgl.Engine("meta-llama/Llama-3.2-1B-Instruct", model-impl="transformers") | ||
| print(llm.generate(["The capital of France is"], {"max_new_tokens": 20})[0]) | ||
| ``` | ||
|
|
||
| Pass `--model-impl transformers` to the `sglang.launch_server` command for online serving. | ||
|
|
||
| ```bash | ||
| python3 -m sglang.launch_server \ | ||
| --model-path meta-llama/Llama-3.2-1B-Instruct \ | ||
| --model-impl transformers \ | ||
| --host 0.0.0.0 \ | ||
| --port 30000 | ||
| ``` | ||
|
|
||
| Setting `model-impl="transformers"` tells SGLang to skip its native model matching and use the `TransformersModel` backend instead. [`PretrainedConfig.from_pretrained`] loads the config and [`AutoModel.config`] resolves the model class. | ||
|
|
||
| During loading, `_attn_implementation` is set to `"sglang"`. This routes attention calls through SGLang. RadixAttention kernels replace standard attention layers. SGLang's parallel linear class replaces linear layers to support tensor parallelism. The model benefits from all SGLang optimizations. | ||
|
|
||
| > [!WARNING] | ||
| > Compatible models require `_supports_attention_backend=True` so SGLang can control attention execution. See the [Building a compatible model backend for inference](./transformers_as_backend#model-implementation) guide for details. | ||
|
|
||
| The [load_weights](https://github.com/sgl-project/sglang/blob/main/python/sglang/srt/models/transformers.py#L277) function populates the model with weights. | ||
|
|
||
| ## Resources | ||
|
|
||
| - [SGLang docs](https://docs.sglang.ai/supported_models/transformers_fallback.html) has more usage examples and tips for using Transformers as a backend. | ||
| - [Transformers backend integration in SGLang](https://huggingface.co/blog/transformers-backend-sglang) blog post explains what this integration enables. | ||
140 changes: 140 additions & 0 deletions
140
docs/source/en/community_integrations/transformers_as_backend.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| # Building a compatible model backend for inference | ||
|
|
||
| Transformers models are compatible with inference engines like [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://docs.sglang.ai). Use the same Transformers model anywhere and avoid reimplementing a model from scratch for each inference engine. Models in Transformers that aren't natively supported by either inference engine work too. | ||
|
|
||
| This guide shows you how to implement a model in Transformers that works as a backend for any inference engine. | ||
|
|
||
| ## Model implementation | ||
|
|
||
| 1. Follow the model [contribution guidelines](./add_new_model) or the [custom model contribution guidelines](./custom_models). The model must have a valid `config.json` in its directory and a valid `auto_map` field pointing to the model class in the config. | ||
|
|
||
| 2. Use the [`AttentionInterface`] class for custom and optimized attention functions. This interface unlocks each inference engine's performance features. | ||
|
|
||
| Use `ALL_ATTENTION_FUNCTIONS` when defining the attention layer and propagate `**kwargs**` from the base `MyModel` class to the attention layers. Set `_supports_attention_backend` to `True` in [`PreTrainedModel`]. | ||
|
|
||
| Expand the code below for an example. | ||
|
|
||
| <details> | ||
| <summary>modeling_my_model.py</summary> | ||
|
|
||
| ```python | ||
| from transformers import PreTrainedModel | ||
| from torch import nn | ||
|
|
||
| class MyAttention(nn.Module): | ||
|
|
||
| def forward(self, hidden_states, **kwargs): | ||
| ... | ||
| attention_interface = ALL_ATTENTION_FUNCTIONS[self.config._attn_implementation] | ||
| attn_output, attn_weights = attention_interface( | ||
| self, | ||
| query_states, | ||
| key_states, | ||
| value_states, | ||
| **kwargs, | ||
| ) | ||
| ... | ||
|
|
||
| class MyModel(PreTrainedModel): | ||
| _supports_attention_backend = True | ||
| ``` | ||
|
|
||
| </details> | ||
|
|
||
| 3. Enable optional tensor or pipeline parallelism by adding the following keys to [`PreTrainedConfig`]. | ||
|
|
||
| * `base_model_tp_plan` enables [tensor parallelism](./perf_infer_gpu_multi) by mapping fully qualified layer name patterns to tensor parallel styles. Supports only the `"colwise"` and `"rowwise"` partitioning strategies. | ||
| * `base_model_pp_plan` enables pipeline parallelism by mapping direct child layer names to tuples of lists of strings. The first element of the tuple contains the names of the input arguments. The last element contains the variable names of the layer outputs in the modeling code. | ||
|
|
||
| Expand the code below for an example. | ||
|
|
||
| <details> | ||
| <summary>configuration_my_model.py</summary> | ||
|
|
||
| ```python | ||
|
|
||
| from transformers import PreTrainedConfig | ||
|
|
||
| class MyConfig(PreTrainedConfig): | ||
| base_model_tp_plan = { | ||
| "layers.*.self_attn.k_proj": "colwise", | ||
| "layers.*.self_attn.v_proj": "colwise", | ||
| "layers.*.self_attn.o_proj": "rowwise", | ||
| "layers.*.mlp.gate_proj": "colwise", | ||
| "layers.*.mlp.up_proj": "colwise", | ||
| "layers.*.mlp.down_proj": "rowwise", | ||
| } | ||
| base_model_pp_plan = { | ||
| "embed_tokens": (["input_ids"], ["inputs_embeds"]), | ||
| "layers": (["hidden_states", "attention_mask"], ["hidden_states"]), | ||
| "norm": (["hidden_states"], ["hidden_states"]), | ||
| } | ||
| ``` | ||
|
|
||
| </details> | ||
|
|
||
| ## Multimodal models | ||
|
|
||
| Multimodal models require additional changes beyond the [vision language model contribution checklist](./contributing#vision-language-model-contribution-checklist). These changes ensure multimodal inputs are properly processed. | ||
|
|
||
| 1. The [`ProcessorMixin`] class must include the `self.image_token` and `self.image_token_ids` attributes. These placeholder tokens indicate image positions in the input. The same token appears in the input prompt for images and in the model code to scatter image features. | ||
|
|
||
| 2. The [`ProcessorMixin`] class must include a `self._get_num_multimodal_tokens` method. This method computes the number of placeholder tokens required for multimodal inputs with given sizes. It returns a [`MultiModalData`] object. Placeholders between `<image>` tokens, such as row or column tokens, don't count as image placeholders. Count only tokens replaced by image features later in the modeling code. | ||
|
|
||
| 3. The [`ProcessorMixin`] class must check the value of `return_mm_token_type_ids` and return `mm_token_type_ids`. This indicates whether each position is a text token (`0`), image placeholder token (`1`), or a video placeholder token (`2`). Multimodal token type id sequences must be contiguous with no breaks between consecutive tokens. Treat special tokens for beginning, ending, row, and column tokens as placeholders. | ||
|
|
||
| Expand the code below for an example. | ||
|
|
||
| <details> | ||
| <summary>modeling_my_multimodal_model.py</summary> | ||
|
|
||
| ```python | ||
| class MyMultimodalProcessor(ProcessorMixin): | ||
|
|
||
| def __call__(self, images=None, text=None, **kwargs): | ||
| if return_mm_token_type_ids: | ||
| mm_token_type_ids = np.zeros_like(input_ids) | ||
| mm_token_type_ids[input_ids == self.image_token_id] = 1 | ||
| text_inputs["mm_token_type_ids"] = mm_token_type_ids.tolist() | ||
| return BatchFeature(data={**text_inputs, **image_inputs}, tensor_type=return_tensors) | ||
|
|
||
| def _get_num_multimodal_tokens(self, image_sizes=None, **kwargs): | ||
| """ | ||
| Computes the number of placeholder tokens needed for multimodal inputs with the given sizes. | ||
| Args: | ||
| image_sizes (`list[list[int]]`, *optional*): | ||
| The input sizes formatted as (height, width) per each image. | ||
| Returns: | ||
| `MultiModalData`: A `MultiModalData` object holding number of tokens per each of the provided | ||
| input modalities, along with other useful data. | ||
| """ | ||
| vision_data = {} | ||
| if image_sizes is not None: | ||
| num_image_tokens = [256] * len(image_sizes) # 256 placeholder tokens for each image always | ||
| num_image_patches = [1] * len(image_sizes) # no patching, thus each image is processed as a single base image | ||
| vision_data.update({"num_image_tokens": num_image_tokens, "num_image_patches": num_image_patches}) | ||
| return MultiModalData(**vision_data) | ||
| ``` | ||
|
|
||
| </details> | ||
|
|
||
| ## Resources | ||
|
|
||
| * Read the [Transformers backend integration in vLLM](https://blog.vllm.ai/2025/04/11/transformers-backend.html) blog post for more details. | ||
| * Read the [Transformers backend integration in SGLang](https://huggingface.co/blog/transformers-backend-sglang) blog post for more details. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,47 @@ | ||
| <!--Copyright 2025 The HuggingFace Team. All rights reserved. | ||
|
|
||
| Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with | ||
| the License. You may obtain a copy of the License at | ||
|
|
||
| http://www.apache.org/licenses/LICENSE-2.0 | ||
|
|
||
| Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on | ||
| an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the | ||
| specific language governing permissions and limitations under the License. | ||
|
|
||
| ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be | ||
| rendered properly in your Markdown viewer. | ||
|
|
||
| --> | ||
|
|
||
| # vLLM | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @hmellor, would you mind taking a look at the vLLM section please? 🙂 |
||
|
|
||
| [vLLM](https://github.com/vllm-project/vllm) is a high-throughput inference engine for serving LLMs at scale. It continuously batches requests and keeps KV cache memory compact with PagedAttention. | ||
|
|
||
| Set `model_impl="transformers"` to load a model using the Transformers modeling backend. | ||
|
|
||
| ```py | ||
| from vllm import LLM | ||
|
|
||
| llm = LLM(model="meta-llama/Llama-3.2-1B", model_impl="transformers") | ||
| print(llm.generate(["The capital of France is"])) | ||
| ``` | ||
|
|
||
| Pass `--model-impl transformers` to the `vllm serve` command for online serving. | ||
|
|
||
| ```bash | ||
| vllm serve meta-llama/Llama-3.2-1B \ | ||
| --task generate \ | ||
| --model-impl transformers | ||
| ``` | ||
|
|
||
| vLLM uses [`AutoConfig.from_pretrained`] to load a model's `config.json` file from the Hub or your Hugging Face cache. It checks the `architectures` field against its internal model registry to determine which vLLM model class to load. If the model isn't in the registry, vLLM calls [`AutoModel.from_config`] to load the Transformers model implementation. | ||
|
|
||
| Setting `model_impl="transformers"` bypasses the vLLM model registry and loads directly from Transformers. vLLM replaces most model modules (MoE, attention, linear, etc.) with its own optimized versions. | ||
|
|
||
| [`AutoTokenizer.from_pretrained`] loads tokenizer files. vLLM caches some tokenizer internals to reduce overhead during inference. Model weights download from the Hub in safetensors format. | ||
|
|
||
| ## Resources | ||
|
|
||
| - [vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips. | ||
| - [Integration with Hugging Face](https://docs.vllm.ai/en/latest/design/huggingface_integration/) explains how vLLM integrates with Transformers. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe @SunMarc can take a look at the SGLang section please 🙂