You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/model_doc/llava.md
+37-1
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,42 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/
40
40
41
41
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
42
42
43
-
- For better results, we recommend users to prompt the model with the correct prompt format. Below is a list of prompt formats accepted by each llava checkpoint:
43
+
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
Flash Attention 2 is an even faster, optimized version of the previous optimization, please refer to the [Flash Attention 2 section of performance docs](https://huggingface.co/docs/transformers/perf_infer_gpu_one).
Copy file name to clipboardExpand all lines: docs/source/en/model_doc/llava_next.md
+86-10
Original file line number
Diff line number
Diff line change
@@ -46,26 +46,61 @@ The original code can be found [here](https://github.com/haotian-liu/LLaVA/tree/
46
46
47
47
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
48
48
49
-
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. Below, we list the correct prompt formats to use for the text prompt "What is shown in this image?":
49
+
- Note that each checkpoint has been trained with a specific prompt format, depending on which large language model (LLM) was used. You can use the processor's `apply_chat_template` to format your prompts correctly. For that you have to construct a conversation history, passing a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities. Below is an example of how to do that and the list of formats accepted by each checkpoint.
50
50
51
-
[llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) requires the following format:
51
+
We will use [llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-hf/llava-v1.6-mistral-7b-hf) and a conversation history of text and image. Each content field has to be a list of dicts, as follows:
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
82
+
print(text_prompt)
83
+
>>>"[INST] <image>\nWhat's shown in this image? [/INST] This image shows a red stop sign. [INST] Describe the image in more details. [/INST]"
84
+
```
85
+
86
+
- If you want to construct a chat prompt yourself, below is a list of possible formats
87
+
.
88
+
[llava-v1.6-mistral-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) requires the following format:
53
89
```bash
54
90
"[INST] <image>\nWhat is shown in this image? [/INST]"
55
91
```
56
92
57
93
[llava-v1.6-vicuna-7b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-7b-hf) and [llava-v1.6-vicuna-13b-hf](https://huggingface.co/llava-hf/llava-v1.6-vicuna-13b-hf) require the following format:
58
-
59
94
```bash
60
95
"A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER: <image>\nWhat is shown in this image? ASSISTANT:"
61
96
```
62
97
63
98
[llava-v1.6-34b-hf](https://huggingface.co/llava-hf/llava-v1.6-34b-hf) requires the following format:
64
-
65
99
```bash
66
100
"<|im_start|>system\nAnswer the questions.<|im_end|><|im_start|>user\n<image>\nWhat is shown in this image?<|im_end|><|im_start|>assistant\n"
67
101
```
68
102
103
+
69
104
## Usage example
70
105
71
106
### Single image inference
@@ -86,8 +121,17 @@ model.to("cuda:0")
86
121
# prepare image and text prompt, using the appropriate prompt template
# Prepare a batched prompt, where the first one is a multi-turn conversation and the second is not
124
-
prompt = [
125
-
"[INST] <image>\nWhat is shown in this image? [/INST] There is a red stop sign in the image. [INST] <image>\nWhat about this image? How many cats do you see [/INST]",
126
-
"[INST] <image>\nWhat is shown in this image? [/INST]"
167
+
# Prepare a batch of two prompts, where the first one is a multi-turn conversation and the second is not
168
+
conversation_1 = [
169
+
{
170
+
"role": "user",
171
+
"content": [
172
+
{"type": "image"},
173
+
{"type": "text", "text": "What is shown in this image?"},
174
+
],
175
+
},
176
+
{
177
+
"role": "assistant",
178
+
"content": [
179
+
{"type": "text", "text": "There is a red stop sign in the image."},
180
+
],
181
+
},
182
+
{
183
+
"role": "user",
184
+
"content": [
185
+
{"type": "image"},
186
+
{"type": "text", "text": "What about this image? How many cats do you see?"},
187
+
],
188
+
},
127
189
]
128
190
191
+
conversation_2 = [
192
+
{
193
+
"role": "user",
194
+
"content": [
195
+
{"type": "image"},
196
+
{"type": "text", "text": "What is shown in this image?"},
Copy file name to clipboardExpand all lines: docs/source/en/model_doc/vipllava.md
+41-7
Original file line number
Diff line number
Diff line change
@@ -26,30 +26,64 @@ The abstract from the paper is the following:
26
26
27
27
*While existing large vision-language multimodal models focus on whole image understanding, there is a prominent gap in achieving region-specific comprehension. Current approaches that use textual coordinates or spatial encodings often fail to provide a user-friendly interface for visual prompting. To address this challenge, we introduce a novel multimodal model capable of decoding arbitrary visual prompts. This allows users to intuitively mark images and interact with the model using natural cues like a "red bounding box" or "pointed arrow". Our simple design directly overlays visual markers onto the RGB image, eliminating the need for complex region encodings, yet achieves state-of-the-art performance on region-understanding tasks like Visual7W, PointQA, and Visual Commonsense Reasoning benchmark. Furthermore, we present ViP-Bench, a comprehensive benchmark to assess the capability of models in understanding visual prompts across multiple dimensions, enabling future research in this domain. Code, data, and model are publicly available.*
28
28
29
-
Tips:
29
+
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
30
+
31
+
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
32
+
33
+
34
+
## Usage tips:
30
35
31
36
- The architecture is similar than llava architecture except that the multi-modal projector takes a set of concatenated vision hidden states and has an additional layernorm layer on that module.
32
37
33
38
- We advise users to use `padding_side="left"` when computing batched generation as it leads to more accurate results. Simply make sure to call `processor.tokenizer.padding_side = "left"` before generating.
34
39
35
40
- Note the model has not been explicitly trained to process multiple images in the same prompt, although this is technically possible, you may experience inaccurate results.
36
41
37
-
- For better results, we recommend users to prompt the model with the correct prompt format:
42
+
- For better results, we recommend users to use the processor's `apply_chat_template()` method to format your prompt correctly. For that you need to construct a conversation history, passing in a plain string will not format your prompt. Each message in the conversation history for chat templates is a dictionary with keys "role" and "content". The "content" should be a list of dictionaries, for "text" and "image" modalities, as follows:
# Note that the template simply formats your prompt, you still have to tokenize it and obtain pixel values for your images
73
+
print(text_prompt)
74
+
>>>"###Human: <image>\nWhat’s shown in this image?###Assistant: This image shows a red stop sign.###Human: Describe the image in more details.###Assistant:"
75
+
```
38
76
77
+
- If you want to construct a chat prompt yourself, below is a list of prompt formats accepted by VipLLaVa checkpoints:
39
78
```bash
40
79
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt>###Assistant:
41
80
```
42
81
43
82
For multiple turns conversation:
44
-
45
83
```bash
46
84
A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.###Human: <image>\n<prompt1>###Assistant: <answer1>###Human: <prompt2>###Assistant:
47
85
```
48
86
49
-
The original code can be found [here](https://github.com/mu-cai/ViP-LLaVA).
50
-
51
-
This model was contributed by [Younes Belkada](https://huggingface.co/ybelkada)
0 commit comments