Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model][VLM] Add Qwen2-VL model support #7905

Merged
merged 44 commits into from
Sep 11, 2024

Conversation

fyabc
Copy link
Contributor

@fyabc fyabc commented Aug 27, 2024

This PR adding support for Qwen2-VL model.

FIX #8139
FIX #8281

Requirements

  • This PR requires transformers with this PR merged and this bugfix PR merged (You can install it via pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830).
  • NOTE: Current latest transformers version have a bug, so you should install a develop version as above now.
  • For transformers>=4.45, please install vLLM from source.
  • For transformers>=4.45, please install vllm>=0.6.3.

Optional Requirements

  • When constructing LLM inputs, we recommend using our helper package qwen-vl-utils to preprocess multimodal content correctly (qwen-vl-utils is not a part of this PR).

Example Usage

from PIL import Image
from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

MODEL_PATH = 'Qwen/Qwen2-VL-7B-Instruct'
IMAGE_PATH = '/path/to/image.jpg'
VIDEO_PATH = '/path/to/video.mp4'

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={'image': 10, 'video': 10},
)

sampling_params = SamplingParams(
    temperature=0.1, top_p=0.001, repetition_penalty=1.05, max_tokens=256,
    stop_token_ids=[],
)

messages = [
    {'role': 'system', 'content': 'You are a helpful assistant.'},
    {'role': 'user', 'content': [
        {
            'type': 'image',
            'image': IMAGE_PATH,

            # min_pixels & max_pixels are optional
            'max_pixels': 12845056,
        },

        # You can also pass one or more videos:
        # {
        #     'type': 'video',
        #     'video': VIDEO_PATH,
        # }

        {
            'type': 'text',
            'text': 'What does this diagram illustrate?',
        },
    ]},
]

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True,
)
image_inputs, video_inputs = process_vision_info(messages)

mm_data = {}
if image_inputs is not None:
    mm_data['image'] = image_inputs
if video_inputs is not None:
    mm_data['video'] = video_inputs

llm_inputs = {
    'prompt': prompt,
    'multi_modal_data': mm_data,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

Notes

Here are some important notes about this PR:

  1. Qwen2-VL uses rotary embedding with multimodal sections (mrope) (see vllm/model_executor/layers/rotary_embedding.py for more details). This rotary embedding requires the input positions to be a tensor of shape (3, seq_len) (instead of (seq_len,) in common case).

    1. To support this feature, we add a new _mrope_position_delta (with type Optional[int]) attribute into vllm.sequence.SequenceData (this attribute is used to compute mrope_input_positions in each decoding step). (If reviewers have a better solution, please comment in this PR)
    2. We also change model_runner.py to compute the mrope_input_positions when the model uses mrope. Other model runners should also follow this logic, I think this can be done in another PR (I will add this part if reviewers thinks it needs to be implemented in this PR).
  2. Qwen2-VL uses flash-attn==2.6.1 (instead of vllm-flash-attn==2.6.1) to compute vision attention (see the commented line 36 in vllm/model_executor/models/qwen2_vl.py). Current vllm-flash-attn version will output NaN logits value, and I am still debugging this bug.

    1. UPDATE 2024.09.06: Add xformers backend as a fallback implementation of Qwen2VisionAttention, so there is no need to add flash-attn into project requirements file.
  3. Qwen2-VL supports both image and video inputs. To support this feature, we add a video multimodal plugin (see vllm/multimodal/video.py for more details).

  4. OpenAI-compatible server

    1. Currently, vllm.entrypoints.openai.api_server uses a model-independent multimodal data fetcher (e.g. vllm.multimodal.utils.async_get_and_parse_image), so vision smart resizing logic in qwen-vl-utils cannot be applied now. I think its good to create another PR to fix it later.
  5. Multiple modalities support details

    Since Qwen2-VL support two modalities (images and videos), we should handle some special cases as below:

    # 1. A batch with two samples, sample 1 contains images, sample 2 contains videos
    llm.generate([
        {
            "prompt": "XXX",
            "multi_modal_data": {
                "image": ...
            }
        },
        {
            "prompt": "XXX",
            "multi_modal_data": {
                "video": ...
            }
        }
    ])
    
    # 2. A single sample with both images and videos
    llm.generate([
        {
            "prompt": "XXX",
            "multi_modal_data": {
                "image": ...,
                "video": ...
            }
        }
    ])

    So I remove the key same check in vllm.multimodal.base.MultiModalInputs.batch() method, since different samples may returns different modality keys.

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

DarkLight1337 commented Aug 29, 2024

Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.

In the meantime, can you fix the CI failures?

@fyabc
Copy link
Contributor Author

fyabc commented Aug 29, 2024

Thanks for implementing this (and sorry for the delayed response)! Since this PR not only introduces a new modality (video) but also involves the first model to accept multiple modalities (excluding text), I would like to merge #7559 first to verify that vLLM can handle video inputs properly.

In the meantime, can you fix the CI failures?

image
Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fyabc Thank you for contributing to vLLM! I took a brief took and left a first round of review. Please take a look.

As @DarkLight1337 mentioned, we might want to wait for #7559 to be merged first because as we're going to have a model that supports a mix of modalities, we want to be careful with API changes.

vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
vllm/model_executor/models/qwen2_vl.py Outdated Show resolved Hide resolved
Comment on lines 626 to 658
# special processing for mrope position deltas.
if self.runner.model_is_mrope:
image_grid_thw = mm_kwargs.get("image_grid_thw", None)
video_grid_thw = mm_kwargs.get("video_grid_thw", None)
assert image_grid_thw is not None or video_grid_thw is not None, \
"mrope embedding type requires multi-modal input mapper returns 'image_grid_thw' or 'video_grid_thw'."

hf_config = self.runner.model_config.hf_config

from vllm.model_executor.layers.rotary_embedding import MRotaryEmbedding

inter_data.mrope_input_positions = [None] * inter_data.n_seqs
for seq_idx in range(inter_data.n_seqs):
seq_data = seq_group_metadata.seq_data[
inter_data.seq_ids[seq_idx]]
token_ids = seq_data.get_token_ids()

mrope_input_positions, mrope_position_delta = MRotaryEmbedding.get_input_positions(
token_ids,
image_grid_thw=image_grid_thw,
video_grid_thw=video_grid_thw,
image_token_id=hf_config.image_token_id,
video_token_id=hf_config.video_token_id,
vision_start_token_id=hf_config.vision_start_token_id,
vision_end_token_id=hf_config.vision_end_token_id,
spatial_merge_size=hf_config.vision_config.
spatial_merge_size,
context_len=inter_data.context_lens[seq_idx],
)

seq_data.mrope_position_delta = mrope_position_delta
inter_data.mrope_input_positions[
seq_idx] = mrope_input_positions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm okay with us doing this at the model runner level and I'm honestly sure if there's a better place to apply mrope. What's your thought on this? @WoosukKwon

@DarkLight1337
Copy link
Member

image Hi @DarkLight1337 , these mypy errors seems not belongs to this PR, should I also fix them?

Can you merge from main first? It fixes some of the mypy errors which might apply here.

@fyabc
Copy link
Contributor Author

fyabc commented Aug 29, 2024

Hi @DarkLight1337 @ywang96 , I have updated this PR based on your review comments, please check it again.
I also add some notes about multiple modalities in the PR overview.

@DragonFive
Copy link

@fyabc Hi, can this patch support mutiple images in one prompt like follows:

Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.

@fyabc
Copy link
Contributor Author

fyabc commented Aug 30, 2024

@fyabc Hi, can this patch support mutiple images in one prompt like follows:

Compute the value of the expression in the image below <image_1>\nby using the emoji equations in the following images <image_2> <image_3> <image_4> <image_5> Only answer specific numerical values.

Hi @DragonFive , you can pass multiple images into a single prompt like this:

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "file:///path/to/image1.jpg"},
            {"type": "image", "image": "file:///path/to/image2.jpg"},
            {"type": "text", "text": "Identify the similarities between these images."},
        ],
    }
]

See "Multi image inference" section of our README for more details.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 17, 2024

Qwen2VLConfig {
"_name_or_path": "./docker/EraX-VL-7B/EraX-VL-7B",
"architectures": [
"Qwen2VLForConditionalGeneration"
],
"attention_dropout": 0.0,
"bos_token_id": 151643,
"eos_token_id": 151645,
"hidden_act": "silu",
"hidden_size": 3584,
"image_token_id": 151655,
"initializer_range": 0.02,
"intermediate_size": 18944,
"max_position_embeddings": 32768,
"max_window_layers": 28,
"model_type": "qwen2_vl",
"num_attention_heads": 28,
"num_hidden_layers": 28,
"num_key_value_heads": 4,
"rms_norm_eps": 1e-06,
"rope_scaling": {
"mrope_section": [
16,
24,
24
],
"rope_type": "default",
"type": "default"
},
"rope_theta": 1000000.0,
"sliding_window": 32768,
"tie_word_embeddings": false,
"torch_dtype": "bfloat16",
"transformers_version": "4.45.0.dev0",
"use_cache": true,
"use_sliding_window": false,
"video_token_id": 151656,
"vision_config": {
"in_chans": 3,
"model_type": "qwen2_vl",
"spatial_patch_size": 14
},
"vision_end_token_id": 151653,
"vision_start_token_id": 151652,
"vision_token_id": 151654,
"vocab_size": 152064
}

This config is outdated. Compare it to the official one here. Notice the difference in rope_scaling.

@thusinh1969
Copy link

I got it. Apply the new config manually and copy over preprocessor.jon and it is workng now.

Thanks,
Steve

@thusinh1969
Copy link

thusinh1969 commented Sep 17, 2024

Ahhhh I could have run vLLM with Qwen2-VL-7B as:

CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --limit-mm-per-prompt image=30 --host 0.0.0.0 --port 9999 --served-model-name EraX-VL-V1 --model ./EraX-VL-7B

I used the code from Qwen2 git (https://github.com/QwenLM/Qwen2-VL)

import cv2
import matplotlib.pyplot as plt
from PIL import Image

import uuid, base64

# Prepare base64 image
test_image1 = './samples/bill-1.png'

with open(test_image1, "rb") as f:
    encoded_image = base64.b64encode(f.read())

encoded_image_text = encoded_image.decode('utf-8')
base64_qwen = f"data:image;base64,{encoded_image_text}"

# Run
from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:9999/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "What is the content of the image?"

chat_response = client.chat.completions.create(
    model="EraX-VL-V1",
    temperature=0.2,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": base64_qwen,
                },
                {
                    "type": "text", 
                    "text": prompt
                },
            ],
        },
    ],
)

ERROR:

   1038         err.response.read()
   1040     log.debug("Re-raising status error")
-> 1041     raise self._make_status_error_from_response(err.response) from None
   1043 return self._process_response(
   1044     cast_to=cast_to,
   1045     options=options,
   (...)
   1049     retries_taken=options.get_max_retries(self.max_retries) - retries,
   1050 )

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: image', 'type': 'BadRequestError', 'param': None, 'code': 400}

Any hint please.

Thanks,
Steve

@5Elza5
Copy link

5Elza5 commented Sep 17, 2024

Ahhhh I could have run vLLM with Qwen2-VL-7B as:

CUDA_VISIBLE_DEVICES=3 python -m vllm.entrypoints.openai.api_server --limit-mm-per-prompt image=30 --host 0.0.0.0 --port 9999 --served-model-name EraX-VL-V1 --model ./EraX-VL-7B

I used the code from Qwen2 git (https://github.com/QwenLM/Qwen2-VL)

import cv2
import matplotlib.pyplot as plt
from PIL import Image

import uuid, base64

# Prepare base64 image
test_image1 = './samples/bill-1.png'

with open(test_image1, "rb") as f:
    encoded_image = base64.b64encode(f.read())

encoded_image_text = encoded_image.decode('utf-8')
base64_qwen = f"data:image;base64,{encoded_image_text}"

# Run
from openai import OpenAI

# Set OpenAI's API key and API base to use vLLM's API server.
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:9999/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

prompt = "What is the content of the image?"

chat_response = client.chat.completions.create(
    model="EraX-VL-V1",
    temperature=0.2,
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": base64_qwen,
                },
                {
                    "type": "text", 
                    "text": prompt
                },
            ],
        },
    ],
)

ERROR:

   1038         err.response.read()
   1040     log.debug("Re-raising status error")
-> 1041     raise self._make_status_error_from_response(err.response) from None
   1043 return self._process_response(
   1044     cast_to=cast_to,
   1045     options=options,
   (...)
   1049     retries_taken=options.get_max_retries(self.max_retries) - retries,
   1050 )

BadRequestError: Error code: 400 - {'object': 'error', 'message': 'Unknown part type: image', 'type': 'BadRequestError', 'param': None, 'code': 400}

Any hint please.

Thanks, Steve

try this:

"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{base64_image}"
from here:
https://platform.openai.com/docs/guides/vision/uploading-base-64-encoded-images

Jeffwan pushed a commit to aibrix/vllm that referenced this pull request Sep 19, 2024
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
@yuanjietu
Copy link

Hi, I am running this and got the same error as #8281. Could someone help me with this? Thank you!

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}

AssertionError Traceback (most recent call last)
/tmp/ipykernel_32014/2600037648.py in
9 del config.rope_scaling['mrope_section']
10
---> 11 llm = LLM(
12 model=MODEL_PATH,
13 limit_mm_per_prompt={'image': 10, 'video': 10},

~/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py in init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, **kwargs)
176 **kwargs,
177 )
--> 178 self.llm_engine = LLMEngine.from_engine_args(
179 engine_args, usage_context=UsageContext.LLM_CLASS)
180 self.request_counter = Counter()

~/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers)
545 """Creates an LLM engine from the engine arguments."""
546 # Create the engine configs.
--> 547 engine_config = engine_args.create_engine_config()
548 executor_class = cls._get_executor_cls(engine_config)
549 # Create the LLM engine.

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_engine_config(self)
842
843 device_config = DeviceConfig(device=self.device)
--> 844 model_config = self.create_model_config()
845
846 if model_config.is_multimodal_model:

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_model_config(self)
780
781 def create_model_config(self) -> ModelConfig:
--> 782 return ModelConfig(
783 model=self.model,
784 tokenizer=self.tokenizer,

~/.local/lib/python3.10/site-packages/vllm/config.py in init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format)
225 self.disable_sliding_window = True
226
--> 227 self.max_model_len = _get_and_verify_max_len(
228 hf_config=self.hf_text_config,
229 max_model_len=max_model_len,

~/.local/lib/python3.10/site-packages/vllm/config.py in _get_and_verify_max_len(hf_config, max_model_len, disable_sliding_window, sliding_window_len, spec_target_max_model_len)
1745 scaling_factor = 1
1746 else:
-> 1747 assert "factor" in rope_scaling
1748 scaling_factor = rope_scaling["factor"]
1749 if rope_type == "yarn":

AssertionError:

@DarkLight1337
Copy link
Member

See my comment above: #7905 (comment)

@exceedzhang
Copy link

Hi, I am running this and got the same error as #8281. Could someone help me with this? Thank you!

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'}

AssertionError Traceback (most recent call last) /tmp/ipykernel_32014/2600037648.py in 9 del config.rope_scaling['mrope_section'] 10 ---> 11 llm = LLM( 12 model=MODEL_PATH, 13 limit_mm_per_prompt={'image': 10, 'video': 10},

~/.local/lib/python3.10/site-packages/vllm/entrypoints/llm.py in init(self, model, tokenizer, tokenizer_mode, skip_tokenizer_init, trust_remote_code, tensor_parallel_size, dtype, quantization, revision, tokenizer_revision, seed, gpu_memory_utilization, swap_space, cpu_offload_gb, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, disable_custom_all_reduce, disable_async_output_proc, **kwargs) 176 **kwargs, 177 ) --> 178 self.llm_engine = LLMEngine.from_engine_args( 179 engine_args, usage_context=UsageContext.LLM_CLASS) 180 self.request_counter = Counter()

~/.local/lib/python3.10/site-packages/vllm/engine/llm_engine.py in from_engine_args(cls, engine_args, usage_context, stat_loggers) 545 """Creates an LLM engine from the engine arguments.""" 546 # Create the engine configs. --> 547 engine_config = engine_args.create_engine_config() 548 executor_class = cls._get_executor_cls(engine_config) 549 # Create the LLM engine.

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_engine_config(self) 842 843 device_config = DeviceConfig(device=self.device) --> 844 model_config = self.create_model_config() 845 846 if model_config.is_multimodal_model:

~/.local/lib/python3.10/site-packages/vllm/engine/arg_utils.py in create_model_config(self) 780 781 def create_model_config(self) -> ModelConfig: --> 782 return ModelConfig( 783 model=self.model, 784 tokenizer=self.tokenizer,

~/.local/lib/python3.10/site-packages/vllm/config.py in init(self, model, tokenizer, tokenizer_mode, trust_remote_code, dtype, seed, revision, code_revision, rope_scaling, rope_theta, tokenizer_revision, max_model_len, spec_target_max_model_len, quantization, quantization_param_path, enforce_eager, max_context_len_to_capture, max_seq_len_to_capture, max_logprobs, disable_sliding_window, skip_tokenizer_init, served_model_name, limit_mm_per_prompt, use_async_output_proc, override_neuron_config, config_format) 225 self.disable_sliding_window = True 226 --> 227 self.max_model_len = _get_and_verify_max_len( 228 hf_config=self.hf_text_config, 229 max_model_len=max_model_len,

~/.local/lib/python3.10/site-packages/vllm/config.py in _get_and_verify_max_len(hf_config, max_model_len, disable_sliding_window, sliding_window_len, spec_target_max_model_len) 1745 scaling_factor = 1 1746 else: -> 1747 assert "factor" in rope_scaling 1748 scaling_factor = rope_scaling["factor"] 1749 if rope_type == "yarn":

AssertionError:

huggingface/transformers#33401

@YuanLiuuuuuu
Copy link

YuanLiuuuuuu commented Sep 26, 2024

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/workspace/lite/test1.py", line 10, in llm = LLM( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init self.llm_engine = LLMEngine.from_engine_args( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args engine_config = engine_args.create_engine_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config model_config = self.create_model_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config return ModelConfig( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init self.max_model_len = _get_and_verify_max_len( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

@AlexanderChen1989 please make sure you have installed this particular version of transformers pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

This version of transformer will raise the following error:

ModuleNotFoundError: No module named 'transformers.models.mllama'

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 26, 2024

T

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/workspace/lite/test1.py", line 10, in llm = LLM( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init self.llm_engine = LLMEngine.from_engine_args( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args engine_config = engine_args.create_engine_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config model_config = self.create_model_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config return ModelConfig( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init self.max_model_len = _get_and_verify_max_len( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

@AlexanderChen1989 please make sure you have installed this particular version of transformers pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

This version of transformer will raise the following error:

ModuleNotFoundError: No module named 'transformers.models.mllama'

The current version of vLLM requires transformers>=4.45. Qwen2-VL has only just been made compatible with transformers>=4.45 in vLLM, so you'll have to install vLLM from source.

@chenzhengda
Copy link

@fyabc Hi, I've noticed that in the Qwen2 VL chat template, there is no '\n' after <|vision_end|>, but there is one when launched through the vllm API server. This seems to be a bug.

siddharth9820 pushed a commit to axonn-ai/vllm that referenced this pull request Sep 30, 2024
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
@SepehrV
Copy link

SepehrV commented Oct 3, 2024

Unrecognized keys in rope_scaling for 'rope_type'='default': {'mrope_section'} Traceback (most recent call last): File "/workspace/lite/test1.py", line 10, in llm = LLM( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 178, in init self.llm_engine = LLMEngine.from_engine_args( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 547, in from_engine_args engine_config = engine_args.create_engine_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 844, in create_engine_config model_config = self.create_model_config() File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/engine/arg_utils.py", line 782, in create_model_config return ModelConfig( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 227, in init self.max_model_len = _get_and_verify_max_len( File "/workspace/lite/venv/lib/python3.10/site-packages/vllm/config.py", line 1747, in _get_and_verify_max_len assert "factor" in rope_scaling AssertionError

@AlexanderChen1989 please make sure you have installed this particular version of transformers pip install git+https://github.com/huggingface/transformers@21fac7abba2a37fae86106f87fcf9974fd1e3830

this transformers version is not compatible with the latest VLLM anymore. (mllama missing).

I tried this using transformers after this fix huggingface/transformers#33753 but vllm is still throwing assert "factor" in rope_scaling

@DarkLight1337
Copy link
Member

Yeah, you need to install vLLM from source to fix the problem now. Please refer to the top post in this thread.

@DarkLight1337 DarkLight1337 mentioned this pull request Oct 5, 2024
1 task
@fyabc
Copy link
Contributor Author

fyabc commented Oct 8, 2024

@fyabc Hi, I've noticed that in the Qwen2 VL chat template, there is no '\n' after <|vision_end|>, but there is one when launched through the vllm API server. This seems to be a bug.

@chenzhengda Hi, by default all mm placeholders are joined with "\n" separator (see vllm.entrypoints.chat_utils._parse_chat_message_content_parts for detailed implementation). It seems that we need to refactor chat_utils.py to fix this bug.
@ywang96 @DarkLight1337 Please also take a look at this problem and check my comments.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 8, 2024

@fyabc Hi, I've noticed that in the Qwen2 VL chat template, there is no '\n' after <|vision_end|>, but there is one when launched through the vllm API server. This seems to be a bug.

@chenzhengda Hi, by default all mm placeholders are joined with "\n" separator (see vllm.entrypoints.chat_utils._parse_chat_message_content_parts for detailed implementation). It seems that we need to refactor chat_utils.py to fix this bug.
@ywang96 @DarkLight1337 Please also take a look at this problem and check my comments.

Thanks for pointing that out. We have this on our multimodality plan but haven't gotten around to implementing it yet. Since many HF chat templates do not specify how to combine placeholder multimodal tokens (like <image>) together, we hardcode this to inserting newlines for now. The semantics of HF chat template and preprocessing differs between models so we need to have more thoughts on this. An RFC to discuss this in detail would be nice, WDYT @ywang96 ?

MengqingCao added a commit to MengqingCao/vllm that referenced this pull request Oct 10, 2024
[Bugfix] Streamed tool calls now more strictly follow OpenAI's format; ensures Vercel AI SDK compatibility (vllm-project#8272)

[Frontend] Add progress reporting to run_batch.py (vllm-project#8060)

Co-authored-by: Adam Lugowski <adam.lugowski@parasail.io>

[Bugfix] Correct adapter usage for cohere and jamba (vllm-project#8292)

[Misc] GPTQ Activation Ordering (vllm-project#8135)

[Misc] Fused MoE Marlin support for GPTQ (vllm-project#8217)

Add NVIDIA Meetup slides, announce AMD meetup, and add contact info (vllm-project#8319)

[Bugfix] Fix missing `post_layernorm` in CLIP (vllm-project#8155)

[CI/Build] enable ccache/scccache for HIP builds (vllm-project#8327)

[Frontend] Clean up type annotations for mistral tokenizer (vllm-project#8314)

[CI/Build] Enabling kernels tests for AMD, ignoring some of then that fail (vllm-project#8130)

Fix ppc64le buildkite job (vllm-project#8309)

[Spec Decode] Move ops.advance_step to flash attn advance_step (vllm-project#8224)

[Misc] remove peft as dependency for prompt models (vllm-project#8162)

[MISC] Keep chunked prefill enabled by default with long context when prefix caching is enabled (vllm-project#8342)

[Bugfix] lookahead block table with cuda graph max capture (vllm-project#8340)

[Bugfix] Ensure multistep lookahead allocation is compatible with cuda graph max capture (vllm-project#8340)

[Core/Bugfix] pass VLLM_ATTENTION_BACKEND to ray workers (vllm-project#8172)

[CI/Build][Kernel] Update CUTLASS to 3.5.1 tag (vllm-project#8043)

[Misc] Skip loading extra bias for Qwen2-MOE GPTQ models (vllm-project#8329)

[Bugfix] Fix InternVL2 vision embeddings process with pipeline parallel (vllm-project#8299)

[Hardware][NV] Add support for ModelOpt static scaling checkpoints. (vllm-project#6112)

[model] Support for Llava-Next-Video model (vllm-project#7559)

Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Frontend] Create ErrorResponse instead of raising exceptions in run_batch (vllm-project#8347)

[Model][VLM] Add Qwen2-VL model support (vllm-project#7905)

Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Hardware][Intel] Support compressed-tensor W8A8 for CPU backend (vllm-project#7257)

[CI/Build] Excluding test_moe.py from AMD Kernels tests for investigation (vllm-project#8373)

[Bugfix] Add missing attributes in mistral tokenizer (vllm-project#8364)

[Kernel][Misc] register ops to prevent graph breaks (vllm-project#6917)

Co-authored-by: Sage Moore <sage@neuralmagic.com>

[Misc] Move device options to a single place (vllm-project#8322)

[Speculative Decoding] Test refactor (vllm-project#8317)

Co-authored-by: youkaichao <youkaichao@126.com>

Pixtral (vllm-project#8377)

Co-authored-by: Roger Wang <ywang@roblox.com>

Bump version to v0.6.1 (vllm-project#8379)

[MISC] Dump model runner inputs when crashing (vllm-project#8305)

[misc] remove engine_use_ray (vllm-project#8126)

[TPU] Use Ray for default distributed backend (vllm-project#8389)

Fix the AMD weight loading tests (vllm-project#8390)

[Bugfix]: Fix the logic for deciding if tool parsing is used (vllm-project#8366)

[Gemma2] add bitsandbytes support for Gemma2 (vllm-project#8338)

[Misc] Raise error when using encoder/decoder model with cpu backend (vllm-project#8355)

[Misc] Use RoPE cache for MRoPE (vllm-project#8396)

[torch.compile] hide slicing under custom op for inductor (vllm-project#8384)

[Hotfix][VLM] Fixing max position embeddings for Pixtral (vllm-project#8399)

[Bugfix] Fix InternVL2 inference with various num_patches (vllm-project#8375)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Model] Support multiple images for qwen-vl (vllm-project#8247)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[BugFix] lazy init _copy_stream to avoid torch init wrong gpu instance (vllm-project#8403)

[BugFix] Fix Duplicate Assignment in Hermes2ProToolParser (vllm-project#8423)

[Bugfix] Offline mode fix (vllm-project#8376)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[multi-step] add flashinfer backend (vllm-project#7928)

[Core] Add engine option to return only deltas or final output (vllm-project#7381)

[Bugfix] multi-step + flashinfer: ensure cuda graph compatible  (vllm-project#8427)

[Hotfix][Core][VLM] Disable chunked prefill by default and prefix caching for multimodal models (vllm-project#8425)

[CI/Build] Disable multi-node test for InternVL2 (vllm-project#8428)

[Hotfix][Pixtral] Fix multiple images bugs (vllm-project#8415)

[Bugfix] Fix weight loading issue by rename variable. (vllm-project#8293)

[Misc] Update Pixtral example (vllm-project#8431)

[BugFix] fix group_topk (vllm-project#8430)

[Core] Factor out input preprocessing to a separate class (vllm-project#7329)

[Bugfix] Mapping physical device indices for e2e test utils (vllm-project#8290)

[Bugfix] Bump fastapi and pydantic version (vllm-project#8435)

[CI/Build] Update pixtral tests to use JSON (vllm-project#8436)

[Bugfix] Fix async log stats (vllm-project#8417)

[bugfix] torch profiler bug for single gpu with GPUExecutor (vllm-project#8354)

bump version to v0.6.1.post1 (vllm-project#8440)

[CI/Build] Enable InternVL2 PP test only on single node (vllm-project#8437)

[doc] recommend pip instead of conda (vllm-project#8446)

[Misc] Skip loading extra bias for Qwen2-VL GPTQ-Int8 (vllm-project#8442)

[misc][ci] fix quant test (vllm-project#8449)

[Installation] Gate FastAPI version for Python 3.8 (vllm-project#8456)

[plugin][torch.compile] allow to add custom compile backend (vllm-project#8445)

[CI/Build] Reorganize models tests (vllm-project#7820)

[Doc] Add oneDNN installation to CPU backend documentation (vllm-project#8467)

[HotFix] Fix final output truncation with stop string + streaming (vllm-project#8468)

bump version to v0.6.1.post2 (vllm-project#8473)

[Hardware][intel GPU] bump up ipex version to 2.3 (vllm-project#8365)

Co-authored-by: Yan Ma <yan.ma@intel.com>

[Kernel][Hardware][Amd]Custom paged attention kernel for rocm (vllm-project#8310)

[Model] support minicpm3 (vllm-project#8297)

Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[torch.compile] fix functionalization (vllm-project#8480)

[torch.compile] add a flag to disable custom op (vllm-project#8488)

[TPU] Implement multi-step scheduling (vllm-project#8489)

[Bugfix][Model] Fix Python 3.8 compatibility in Pixtral model by updating type annotations (vllm-project#8490)

[Bugfix][Kernel] Add `IQ1_M` quantization implementation to GGUF kernel (vllm-project#8357)

[Kernel] Enable 8-bit weights in Fused Marlin MoE (vllm-project#8032)

Co-authored-by: Dipika <dipikasikka1@gmail.com>

[Frontend] Expose revision arg in OpenAI server (vllm-project#8501)

[BugFix] Fix clean shutdown issues (vllm-project#8492)

[Bugfix][Kernel] Fix build for sm_60 in GGUF kernel (vllm-project#8506)

[Kernel] AQ AZP 3/4: Asymmetric quantization kernels (vllm-project#7270)

[doc] update doc on testing and debugging (vllm-project#8514)

[Bugfix] Bind api server port before starting engine (vllm-project#8491)

[perf bench] set timeout to debug hanging (vllm-project#8516)

[misc] small qol fixes for release process (vllm-project#8517)

[Bugfix] Fix 3.12 builds on main (vllm-project#8510)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[refactor] remove triton based sampler (vllm-project#8524)

[Frontend] Improve Nullable kv Arg Parsing (vllm-project#8525)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[Misc][Bugfix] Disable guided decoding for mistral tokenizer (vllm-project#8521)

[torch.compile] register allreduce operations as custom ops (vllm-project#8526)

[Misc] Limit to ray[adag] 2.35 to avoid backward incompatible change (vllm-project#8509)

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>

[Benchmark] Support sample from HF datasets and image input for benchmark_serving (vllm-project#8495)

[Encoder decoder] Add cuda graph support during decoding for encoder-decoder models (vllm-project#7631)

[Feature][kernel] tensor parallelism with bitsandbytes quantization (vllm-project#8434)

[Model] Add mistral function calling format to all models loaded with "mistral" format (vllm-project#8515)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Misc] Don't dump contents of kvcache tensors on errors (vllm-project#8527)

[Bugfix] Fix TP > 1 for new granite (vllm-project#8544)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[doc] improve installation doc (vllm-project#8550)

Co-authored-by: Andy Dai <76841985+Imss27@users.noreply.github.com>

[CI/Build] Excluding kernels/test_gguf.py from ROCm (vllm-project#8520)

[Kernel] Change interface to Mamba causal_conv1d_update for continuous batching (vllm-project#8012)

[CI/Build] fix Dockerfile.cpu on podman (vllm-project#8540)

[Misc] Add argument to disable FastAPI docs (vllm-project#8554)

[CI/Build] Avoid CUDA initialization (vllm-project#8534)

[CI/Build] Update Ruff version (vllm-project#8469)

Signed-off-by: Aaron Pham <contact@aarnphm.xyz>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>

[Core][Bugfix][Perf] Introduce `MQLLMEngine` to avoid `asyncio` OH (vllm-project#8157)

Co-authored-by: Nick Hill <nickhill@us.ibm.com>
Co-authored-by: rshaw@neuralmagic.com <rshaw@neuralmagic.com>
Co-authored-by: Robert Shaw <114415538+robertgshaw2-neuralmagic@users.noreply.github.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>

[Core] *Prompt* logprobs support in Multi-step (vllm-project#8199)

[Core] zmq: bind only to 127.0.0.1 for local-only usage (vllm-project#8543)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

[Model] Support Solar Model (vllm-project#8386)

Co-authored-by: Michael Goin <michael@neuralmagic.com>

[AMD][ROCm]Quantization methods on ROCm; Fix _scaled_mm call (vllm-project#8380)

Co-authored-by: Alexei-V-Ivanov-AMD <156011006+Alexei-V-Ivanov-AMD@users.noreply.github.com>
Co-authored-by: Michael Goin <michael@neuralmagic.com>

[Kernel] Change interface to Mamba selective_state_update for continuous batching (vllm-project#8039)

[BugFix] Nonzero exit code if MQLLMEngine startup fails (vllm-project#8572)

[Bugfix] add `dead_error` property to engine client (vllm-project#8574)

Signed-off-by: Joe Runde <Joseph.Runde@ibm.com>

[Kernel] Remove marlin moe templating on thread_m_blocks (vllm-project#8573)

Co-authored-by: lwilkinson@neuralmagic.com

[Bugfix] [Encoder-Decoder] Bugfix for encoder specific metadata construction during decode of encoder-decoder models.  (vllm-project#8545)

Revert "[Misc][Bugfix] Disable guided decoding for mistral tokenizer" (vllm-project#8593)

[Bugfix] fixing sonnet benchmark bug in benchmark_serving.py (vllm-project#8616)

[MISC] remove engine_use_ray in benchmark_throughput.py (vllm-project#8615)

[Frontend] Use MQLLMEngine for embeddings models too (vllm-project#8584)

[Kernel][Amd] Add fp8 kv cache support for rocm custom paged attention (vllm-project#8577)

[Core] simplify logits resort in _apply_top_k_top_p (vllm-project#8619)

[Doc] Add documentation for GGUF quantization (vllm-project#8618)

Create SECURITY.md (vllm-project#8642)

[CI/Build] Re-enabling Entrypoints tests on ROCm, excluding ones that fail (vllm-project#8551)

[Misc] guard against change in cuda library name (vllm-project#8609)

[Bugfix] Fix Phi3.5 mini and MoE LoRA inference (vllm-project#8571)

[bugfix] [AMD] add multi-step advance_step to ROCmFlashAttentionMetadata (vllm-project#8474)

[Core] Support Lora lineage and base model metadata management (vllm-project#6315)

[Model] Add OLMoE (vllm-project#7922)

[CI/Build] Removing entrypoints/openai/test_embedding.py test from ROCm build (vllm-project#8670)

[Bugfix] Validate SamplingParam n is an int (vllm-project#8548)

[Misc] Show AMD GPU topology in `collect_env.py` (vllm-project#8649)

[Bugfix] Config got an unexpected keyword argument 'engine' (vllm-project#8556)

[Bugfix][Core] Fix tekken edge case for mistral tokenizer (vllm-project#8640)

[Doc] neuron documentation update (vllm-project#8671)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

[Hardware][AWS] update neuron to 2.20 (vllm-project#8676)

Signed-off-by: omrishiv <327609+omrishiv@users.noreply.github.com>

[Bugfix] Fix incorrect llava next feature size calculation (vllm-project#8496)

[Core] Rename `PromptInputs` and `inputs`(vllm-project#8673)

[MISC] add support custom_op check (vllm-project#8557)

Co-authored-by: youkaichao <youkaichao@126.com>

[Core] Factor out common code in `SequenceData` and `Sequence` (vllm-project#8675)

[beam search] add output for manually checking the correctness (vllm-project#8684)

[Kernel] Build flash-attn from source (vllm-project#8245)

[VLM] Use `SequenceData.from_token_counts` to create dummy data (vllm-project#8687)

[Doc] Fix typo in AMD installation guide (vllm-project#8689)

[Kernel][Triton][AMD] Remove tl.atomic_add from awq_gemm_kernel, 2-5x speedup MI300, minor improvement for MI250 (vllm-project#8646)

[dbrx] refactor dbrx experts to extend FusedMoe class (vllm-project#8518)

[Kernel][Bugfix] Delete some more useless code in marlin_moe_ops.cu (vllm-project#8643)

[Bugfix] Refactor composite weight loading logic (vllm-project#8656)

[ci][build] fix vllm-flash-attn (vllm-project#8699)

[Model] Refactor BLIP/BLIP-2 to support composite model loading (vllm-project#8407)

[Misc] Use NamedTuple in Multi-image example (vllm-project#8705)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[MISC] rename CudaMemoryProfiler to DeviceMemoryProfiler (vllm-project#8703)

[Model][VLM] Add LLaVA-Onevision model support (vllm-project#8486)

Co-authored-by: litianjian <litianjian@bytedance.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[SpecDec][Misc] Cleanup, remove bonus token logic. (vllm-project#8701)

[build] enable existing pytorch (for GH200, aarch64, nightly) (vllm-project#8713)

[misc] upgrade mistral-common (vllm-project#8715)

[Bugfix] Avoid some bogus messages RE CUTLASS's revision when building (vllm-project#8702)

[Bugfix] Fix CPU CMake build (vllm-project#8723)

Co-authored-by: Yuan <yuan.zhou@intel.com>

[Bugfix] fix docker build for xpu (vllm-project#8652)

[Core][Frontend] Support Passing Multimodal Processor Kwargs (vllm-project#8657)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>

[Hardware][CPU] Refactor CPU model runner (vllm-project#8729)

[Bugfix][CPU] fix missing input intermediate_tensors in the cpu_model_runner (vllm-project#8733)

[Model] Support pp for qwen2-vl (vllm-project#8696)

[VLM] Fix paligemma, fuyu and persimmon with transformers 4.45 : use config.text_config.vocab_size (vllm-project#8707)

[CI/Build] use setuptools-scm to set __version__ (vllm-project#4738)

Co-authored-by: youkaichao <youkaichao@126.com>

[Kernel] (2/N) Machete - Integrate into CompressedTensorsWNA16 and GPTQMarlin (vllm-project#7701)

Co-authored-by: mgoin <michael@neuralmagic.com>
Co-authored-by: Divakar Verma <137818590+divakar-amd@users.noreply.github.com>
Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[Kernel][LoRA]  Add assertion for punica sgmv kernels (vllm-project#7585)

[Core] Allow IPv6 in VLLM_HOST_IP with zmq (vllm-project#8575)

Signed-off-by: Russell Bryant <rbryant@redhat.com>

Fix typical acceptance sampler with correct recovered token ids (vllm-project#8562)

Add output streaming support to multi-step + async while ensuring RequestOutput obj reuse (vllm-project#8335)

[Hardware][AMD] ROCm6.2 upgrade (vllm-project#8674)

Fix tests in test_scheduler.py that fail with BlockManager V2 (vllm-project#8728)

re-implement beam search on top of vllm core (vllm-project#8726)

Co-authored-by: Brendan Wong <bjwpokemon@gmail.com>

Revert "[Core] Rename `PromptInputs` to `PromptType`, and `inputs` to `prompt`" (vllm-project#8750)

[MISC] Skip dumping inputs when unpicklable (vllm-project#8744)

[Core][Model] Support loading weights by ID within models (vllm-project#7931)

[Model] Expose Phi3v num_crops as a mm_processor_kwarg (vllm-project#8658)

Signed-off-by: Alex-Brooks <Alex.Brooks@ibm.com>
Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>

[Bugfix] Fix potentially unsafe custom allreduce synchronization (vllm-project#8558)

[Kernel] Split Marlin MoE kernels into multiple files (vllm-project#8661)

Co-authored-by: mgoin <michael@neuralmagic.com>

[Frontend] Batch inference for llm.chat() API  (vllm-project#8648)

Co-authored-by: Cyrus Leung <cyrus.tl.leung@gmail.com>
Co-authored-by: Cyrus Leung <tlleungac@connect.ust.hk>
Co-authored-by: Roger Wang <ywang@roblox.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>

[Bugfix] Fix torch dynamo fixes caused by `replace_parameters` (vllm-project#8748)

[CI/Build] fix setuptools-scm usage (vllm-project#8771)

[misc] soft drop beam search (vllm-project#8763)

[[Misc]Upgrade bitsandbytes to the latest version 0.44.0 (vllm-project#8768)

[Core][Bugfix] Support prompt_logprobs returned with speculative decoding (vllm-project#8047)

Signed-off-by: Travis Johnson <tsjohnso@us.ibm.com>

[Core] Adding Priority Scheduling (vllm-project#5958)

[Bugfix] Use heartbeats instead of health checks (vllm-project#8583)

Fix test_schedule_swapped_simple in test_scheduler.py (vllm-project#8780)

[Bugfix][Kernel] Implement acquire/release polyfill for Pascal (vllm-project#8776)

Fix tests in test_chunked_prefill_scheduler which fail with BlockManager V2 (vllm-project#8752)

[BugFix] Propagate 'trust_remote_code' setting in internvl and minicpmv (vllm-project#8250)

[Hardware][CPU] Enable mrope and support Qwen2-VL on CPU backend (vllm-project#8770)

[Bugfix] load fc bias from config for eagle (vllm-project#8790)

[Frontend] OpenAI server: propagate usage accounting to FastAPI middleware layer (vllm-project#8672)

[Bugfix] Ray 2.9.x doesn't expose available_resources_per_node (vllm-project#8767)

Signed-off-by: darthhexx <darthhexx@gmail.com>

[Misc] Fix minor typo in scheduler (vllm-project#8765)

[CI/Build][Bugfix][Doc][ROCm] CI fix and doc update after ROCm 6.2 upgrade (vllm-project#8777)

[Kernel] Fullgraph and opcheck tests (vllm-project#8479)

[[Misc]] Add extra deps for openai server image (vllm-project#8792)

[VLM][Bugfix] internvl with num_scheduler_steps > 1 (vllm-project#8614)

rename PromptInputs and inputs with backward compatibility (vllm-project#8760)

[Frontend] MQLLMEngine supports profiling. (vllm-project#8761)

[Misc] Support FP8 MoE for compressed-tensors (vllm-project#8588)

Revert "rename PromptInputs and inputs with backward compatibility (vllm-project#8760) (vllm-project#8810)

[Model] Add support for the multi-modal Llama 3.2 model (vllm-project#8811)

Co-authored-by: simon-mo <xmo@berkeley.edu>
Co-authored-by: Chang Su <chang.s.su@oracle.com>
Co-authored-by: Simon Mo <simon.mo@hey.com>
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: Roger Wang <ywang@roblox.com>

[Doc] Update doc for Transformers 4.45 (vllm-project#8817)

[Misc] Support quantization of MllamaForCausalLM (vllm-project#8822)

[Misc] Update config loading for Qwen2-VL and remove Granite (vllm-project#8837)

[Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814)

[Docs] Add README to the build docker image (vllm-project#8825)

[CI/Build] Fix missing ci dependencies (vllm-project#8834)

[misc][installation] build from source without compilation (vllm-project#8818)

[ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872)

Signed-off-by: kevin <kevin@anyscale.com>

[Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861)

[Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820)

[Bugfix] Fix print_warning_once's line info (vllm-project#8867)

fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568)

[Bugfix] Fixup advance_step.cu warning (vllm-project#8815)

[BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829)

[Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764)

[Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343)

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>

[Core] rename`PromptInputs` and `inputs` (vllm-project#8876)

[misc] fix collect env (vllm-project#8894)

[MISC] Fix invalid escape sequence '\' (vllm-project#8830)

Signed-off-by: Peter Pan <Peter.Pan@daocloud.io>

[Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892)

[TPU] Update pallas.py to support trillium (vllm-project#8871)

[torch.compile] use empty tensor instead of None for profiling (vllm-project#8875)

[Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271)

[Bugfix] fix for deepseek w4a16 (vllm-project#8906)

Co-authored-by: mgoin <michael@neuralmagic.com>

[Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378)

Co-authored-by: Varun Sundar Rabindranath <varun@neuralmagic.com>

[misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911)

[Core] Priority-based scheduling in async engine (vllm-project#8850)

[misc] fix wheel name (vllm-project#8919)

[Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824)

Signed-off-by: tylertitsworth <tyler.titsworth@intel.com>
Co-authored-by: youkaichao <youkaichao@126.com>

[Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921)

[Bugfix] Fix code for downloading models from modelscope (vllm-project#8443)

[Bugfix] Fix PP for Multi-Step (vllm-project#8887)

[CI/Build] Update models tests & examples (vllm-project#8874)

Co-authored-by: Roger Wang <ywang@roblox.com>

[Frontend] Make beam search emulator temperature modifiable (vllm-project#8928)

Co-authored-by: Eduard Balzin <nfunctor@yahoo.fr>

[Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891)

[doc] organize installation doc and expose per-commit docker (vllm-project#8931)

[Core] Improve choice of Python multiprocessing method (vllm-project#8823)

Signed-off-by: Russell Bryant <rbryant@redhat.com>
Co-authored-by: youkaichao <youkaichao@126.com>

[Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824)

[Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741)

Co-authored-by: Tyler Michael Smith <tyler@neuralmagic.com>

[CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925)

[Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930)

[Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896)

[Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199)

[BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870)

Co-authored-by: Roger Wang <ywang@roblox.com>

[Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944)

[Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942)

[Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533)
@seanll-ke
Copy link

Does qwen2vl deployed using vllm support function call?

@baisong666
Copy link

Does qwen2vl deployed using vllm support function call?

+1

@whyiug
Copy link
Contributor

whyiug commented Oct 16, 2024

@fyabc Hi, I've noticed that in the Qwen2 VL chat template, there is no '\n' after <|vision_end|>, but there is one when launched through the vllm API server. This seems to be a bug.

@chenzhengda Hi, by default all mm placeholders are joined with "\n" separator (see vllm.entrypoints.chat_utils._parse_chat_message_content_parts for detailed implementation). It seems that we need to refactor chat_utils.py to fix this bug.
@ywang96 @DarkLight1337 Please also take a look at this problem and check my comments.

Thanks for pointing that out. We have this on our multimodality plan but haven't gotten around to implementing it yet. Since many HF chat templates do not specify how to combine placeholder multimodal tokens (like <image>) together, we hardcode this to inserting newlines for now. The semantics of HF chat template and preprocessing differs between models so we need to have more thoughts on this. An RFC to discuss this in detail would be nice, WDYT @ywang96 ?

For those of who want a temporary fix for this, here's how I do it, then reinstall vllm. Also expect an official fix soon.
whyiug@3495e80#diff-31e6bd0df09a47b5587701203d558701ac46e4f85bf7db83632da9990eaef198R382

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Alvant <alvasian@yandex.ru>
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
Co-authored-by: Roger Wang <136131678+ywang96@users.noreply.github.com>
Co-authored-by: DarkLight1337 <tlleungac@connect.ust.hk>
Signed-off-by: Amit Garg <mitgarg17495@gmail.com>
@mgoin
Copy link
Collaborator

mgoin commented Oct 29, 2024

EDIT: Nevermind, I just had a silly issue where weight_scale was being read as the weight parameter, PR here #9817

Hey @fyabc I am working on expanding quantization for multimodal models and currently this special case in the qwen2vl weight loading is causing issues

if "visual" in name and "qkv.weight" in name:
visual_num_heads = self.config.vision_config.num_heads
visual_embed_dim = self.config.vision_config.embed_dim
head_size = visual_embed_dim // visual_num_heads
loaded_weight = loaded_weight.view(3, visual_num_heads,
head_size,
visual_embed_dim)
loaded_weight = loaded_weight.transpose(0, 1)
loaded_weight = loaded_weight.reshape(-1, visual_embed_dim)
elif "visual" in name and "qkv.bias" in name:
visual_num_heads = self.config.vision_config.num_heads
visual_embed_dim = self.config.vision_config.embed_dim
head_size = visual_embed_dim // visual_num_heads
loaded_weight = loaded_weight.view(3, visual_num_heads,
head_size)
loaded_weight = loaded_weight.transpose(0, 1)
loaded_weight = loaded_weight.reshape(-1)

Could you offer insight into why this is required and if we could apply a transformation on the inputs rather than the weights?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: Qwen2-VL AssertionError: assert "factor" in rope_scaling. [New Model]: Qwen2-VL