You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/community_integrations/vllm.md
+6-6Lines changed: 6 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,7 +18,7 @@ rendered properly in your Markdown viewer.
18
18
19
19
[vLLM](https://github.com/vllm-project/vllm) is a high-throughput inference engine for serving LLMs at scale. It continuously batches requests and keeps KV cache memory compact with PagedAttention.
20
20
21
-
Set `model_impl="transformers"` to load a Transformers modeling backend.
21
+
Set `model_impl="transformers"` to load a model using the Transformers modeling backend.
vLLM fetches `config.json` from the Hub or your Hugging Face cache and reads `model_type`to load the config class. If the model type isn't in vLLM's internal registry, it falls back to [`AutoConfig.from_pretrained`]. With `model_impl="transformers"`, vLLM passes the config to [`~AutoModelForCausalLM.from_pretrained`] to instantiate the model instead of using vLLM's native model classes.
38
+
vLLM uses [`AutoConfig.from_pretrained`] to load a model's `config.json`file from the Hub or your Hugging Face cache. It checks the `architectures` field against its internal model registry to determine which vLLM model class to load. If the model isn't in the registry, vLLM calls [`AutoModel.from_config`] to load the Transformers model implementation.
39
39
40
-
[`AutoTokenizer.from_pretrained`] loads tokenizer files. Some tokenizer internals are cached to reduce repeated overhead during inference. Model weights download from the Hub in safetensors format.
40
+
Setting `model_impl="transformers"` bypasses the vLLM model registry and loads directly from Transformers. vLLM replaces most model modules (MoE, attention, linear, etc.) with its own optimized versions.
41
41
42
-
Transformers handles the model and forward pass. vLLM provides the high-throughput serving stack.
42
+
[`AutoTokenizer.from_pretrained`] loads tokenizer files. vLLM caches some tokenizer internals to reduce overhead during inference. Model weights download from the Hub in safetensors format.
43
43
44
44
## Resources
45
45
46
-
-Refer to the [vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips for using a Transformers model as the backend.
47
-
-Learn more about how vLLM integrates with Transformers in the [Integration with Hugging Face](https://docs.vllm.ai/en/latest/design/huggingface_integration/)doc.
46
+
-[vLLM docs](https://docs.vllm.ai/en/latest/models/supported_models.html#transformers) for more usage examples and tips.
47
+
-[Integration with Hugging Face](https://docs.vllm.ai/en/latest/design/huggingface_integration/)explains how vLLM integrates with Transformers.
0 commit comments