You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/models/vlm.rst
+7-7Lines changed: 7 additions & 7 deletions
Original file line number
Diff line number
Diff line change
@@ -25,7 +25,7 @@ The :class:`~vllm.LLM` class can be instantiated in much the same way as languag
25
25
To pass an image to the model, note the following in :class:`vllm.inputs.PromptType`:
26
26
27
27
* ``prompt``: The prompt should follow the format that is documented on HuggingFace.
28
-
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
28
+
* ``multi_modal_data``: This is a dictionary that follows the schema defined in :class:`vllm.multimodal.MultiModalDataDict`.
29
29
30
30
.. code-block:: python
31
31
@@ -34,7 +34,7 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptT
34
34
35
35
# Load the image using PIL.Image
36
36
image =PIL.Image.open(...)
37
-
37
+
38
38
# Single prompt inference
39
39
outputs = llm.generate({
40
40
"prompt": prompt,
@@ -68,7 +68,7 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptT
68
68
"prompt": prompt,
69
69
"multi_modal_data": mm_data,
70
70
})
71
-
71
+
72
72
for o in outputs:
73
73
generated_text = o.outputs[0].text
74
74
print(generated_text)
@@ -116,7 +116,7 @@ Instead of passing in a single image, you can pass in a list of images.
116
116
.. code-block:: python
117
117
118
118
# Refer to the HuggingFace repo for the correct format to use
119
-
prompt ="<|user|>\n<image_1>\n<image_2>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
119
+
prompt ="<|user|>\n<|image_1|>\n<|image_2|>\nWhat is the content of each image?<|end|>\n<|assistant|>\n"
120
120
121
121
# Load the images using PIL.Image
122
122
image1 =PIL.Image.open(...)
@@ -135,11 +135,11 @@ Instead of passing in a single image, you can pass in a list of images.
135
135
136
136
A code example can be found in `examples/offline_inference_vision_language_multi_image.py <https://github.com/vllm-project/vllm/blob/main/examples/offline_inference_vision_language_multi_image.py>`_.
137
137
138
-
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
138
+
Multi-image input can be extended to perform video captioning. We show this with `Qwen2-VL <https://huggingface.co/Qwen/Qwen2-VL-2B-Instruct>`_ as it supports videos:
139
139
140
140
.. code-block:: python
141
141
142
-
# Specify the maximum number of frames per video to be 4. This can be changed.
142
+
# Specify the maximum number of frames per video to be 4. This can be changed.
0 commit comments