Skip to content

Commit cd558f1

Browse files
committed
feedback
1 parent 31e7fab commit cd558f1

File tree

1 file changed

+6
-6
lines changed
  • docs/source/en/community_integrations

1 file changed

+6
-6
lines changed

docs/source/en/community_integrations/vllm.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
1818

1919
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput inference engine for serving LLMs at scale. It continuously batches requests and keeps KV cache memory compact with PagedAttention.
2020

21-
Set `model_impl="transformers"` to load a Transformers modeling backend.
21+
Set `model_impl="transformers"` to load a model using the Transformers modeling backend.
2222

2323
```py
2424
from vllm import LLM
@@ -35,13 +35,13 @@ vllm serve meta-llama/Llama-3.2-1B \
3535
--model-impl transformers
3636
```
3737

38-
vLLM fetches `config.json` from the Hub or your Hugging Face cache and reads `model_type` to load the config class. If the model type isn't in vLLM's internal registry, it falls back to [`AutoConfig.from_pretrained`]. With `model_impl="transformers"`, vLLM passes the config to [`~AutoModelForCausalLM.from_pretrained`] to instantiate the model instead of using vLLM's native model classes.
38+
vLLM uses [`AutoConfig.from_pretrained`] to load a model's `config.json` file from the Hub or your Hugging Face cache. It checks the `architectures` field against its internal model registry to determine which vLLM model class to load. If the model isn't in the registry, vLLM calls [`AutoModel.from_config`] to load the Transformers model implementation.
3939

40-
[`AutoTokenizer.from_pretrained`] loads tokenizer files. Some tokenizer internals are cached to reduce repeated overhead during inference. Model weights download from the Hub in safetensors format.
40+
Setting `model_impl="transformers"` bypasses the vLLM model registry and loads directly from Transformers. vLLM replaces most model modules (MoE, attention, linear, etc.) with its own optimized versions.
4141

42-
Transformers handles the model and forward pass. vLLM provides the high-throughput serving stack.
42+
[`AutoTokenizer.from_pretrained`] loads tokenizer files. vLLM caches some tokenizer internals to reduce overhead during inference. Model weights download from the Hub in safetensors format.
4343

4444
## Resources
4545

46-
- Refer to the [vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips for using a Transformers model as the backend.
47-
- Learn more about how vLLM integrates with Transformers in the [Integration with Hugging Face](https://docs.vllm.ai/en/latest/design/huggingface_integration/) doc.
46+
- [vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips.
47+
- [Integration with Hugging Face](https://docs.vllm.ai/en/latest/design/huggingface_integration/) explains how vLLM integrates with Transformers.

0 commit comments

Comments
 (0)