Skip to content

Commit 1bc3b5e

Browse files
[VLM] Separate text-only and vision variants of the same model architecture (#13157)
1 parent 02ed8a1 commit 1bc3b5e

File tree

15 files changed

+1729
-1643
lines changed

15 files changed

+1729
-1643
lines changed

docs/source/models/supported_models.md

Lines changed: 8 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -699,10 +699,10 @@ See [this page](#generative-models) for more information on how to use generativ
699699
*
700700
* ✅︎
701701
* ✅︎
702-
- * `DeepseekVLV2ForCausalLM`
702+
- * `DeepseekVLV2ForCausalLM`<sup>^</sup>
703703
* DeepSeek-VL2
704704
* T + I<sup>+</sup>
705-
* `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc. (see note)
705+
* `deepseek-ai/deepseek-vl2-tiny`, `deepseek-ai/deepseek-vl2-small`, `deepseek-ai/deepseek-vl2` etc.
706706
*
707707
* ✅︎
708708
* ✅︎
@@ -713,10 +713,10 @@ See [this page](#generative-models) for more information on how to use generativ
713713
*
714714
* ✅︎
715715
* ✅︎
716-
- * `ChatGLMModel`
716+
- * `GLM4VForCausalLM`<sup>^</sup>
717717
* GLM-4V
718718
* T + I
719-
* `THUDM/glm-4v-9b` etc.
719+
* `THUDM/glm-4v-9b`, `THUDM/cogagent-9b-20241220` etc.
720720
* ✅︎
721721
* ✅︎
722722
* ✅︎
@@ -825,7 +825,7 @@ See [this page](#generative-models) for more information on how to use generativ
825825
*
826826
* ✅︎
827827
* ✅︎
828-
- * `QWenLMHeadModel`
828+
- * `QwenVLForConditionalGeneration`<sup>^</sup>
829829
* Qwen-VL
830830
* T + I<sup>E+</sup>
831831
* `Qwen/Qwen-VL`, `Qwen/Qwen-VL-Chat`, etc.
@@ -862,13 +862,12 @@ See [this page](#generative-models) for more information on how to use generativ
862862
* ✅︎
863863
:::
864864

865+
<sup>^</sup> You need to set the architecture name via `--hf-overrides` to match the one in vLLM.
866+
&nbsp;&nbsp;&nbsp;&nbsp;• For example, to use DeepSeek-VL2 series models:
867+
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;`--hf-overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'`
865868
<sup>E</sup> Pre-computed embeddings can be inputted for this modality.
866869
<sup>+</sup> Multiple items can be inputted per text prompt for this modality.
867870

868-
:::{note}
869-
To use DeepSeek-VL2 series models, you have to pass `--hf_overrides '{"architectures": ["DeepseekVLV2ForCausalLM"]}'` when running vLLM.
870-
:::
871-
872871
:::{note}
873872
H2O-VL series models will be available in V1 once we support backends other than FlashAttention.
874873
:::

examples/offline_inference/vision_language.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,9 @@ def run_glm4v(question: str, modality: str):
105105
max_num_seqs=2,
106106
trust_remote_code=True,
107107
enforce_eager=True,
108+
hf_overrides={"architectures": ["GLM4VForCausalLM"]},
108109
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache)
110+
109111
prompt = f"<|user|>\n<|begin_of_image|><|endoftext|><|end_of_image|>\
110112
{question}<|assistant|>"
111113

@@ -495,6 +497,7 @@ def run_qwen_vl(question: str, modality: str):
495497
trust_remote_code=True,
496498
max_model_len=1024,
497499
max_num_seqs=2,
500+
hf_overrides={"architectures": ["QwenVLForConditionalGeneration"]},
498501
disable_mm_preprocessor_cache=args.disable_mm_preprocessor_cache,
499502
)
500503

examples/offline_inference/vision_language_multi_image.py

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -77,7 +77,7 @@ def load_deepseek_vl2(question: str, image_urls: List[str]):
7777
)
7878

7979

80-
def load_h2onvl(question: str, image_urls: List[str]) -> ModelRequestData:
80+
def load_h2ovl(question: str, image_urls: List[str]) -> ModelRequestData:
8181
model_name = "h2oai/h2ovl-mississippi-2b"
8282

8383
llm = LLM(
@@ -302,6 +302,7 @@ def load_qwen_vl_chat(question: str,
302302
trust_remote_code=True,
303303
max_model_len=1024,
304304
max_num_seqs=2,
305+
hf_overrides={"architectures": ["QwenVLForConditionalGeneration"]},
305306
limit_mm_per_prompt={"image": len(image_urls)},
306307
)
307308
placeholders = "".join(f"Picture {i}: <img></img>\n"
@@ -452,7 +453,7 @@ def load_qwen2_5_vl(question, image_urls: List[str]) -> ModelRequestData:
452453
model_example_map = {
453454
"aria": load_aria,
454455
"deepseek_vl_v2": load_deepseek_vl2,
455-
"h2ovl_chat": load_h2onvl,
456+
"h2ovl_chat": load_h2ovl,
456457
"idefics3": load_idefics3,
457458
"internvl_chat": load_internvl,
458459
"mllama": load_mllama,

0 commit comments

Comments
 (0)