-
-
Notifications
You must be signed in to change notification settings - Fork 8.4k
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. #20105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. #20105
Conversation
…strings. Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @huachenheli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request significantly enhances the flexibility and extensibility of multimodal model support by introducing configurable placeholder strings and dynamic video sampling policies. These changes empower users to easily integrate new multimodal models and customize video input processing without modifying core framework code, making the system more adaptable to future model additions and diverse use cases.
Highlights
- Configurable Multimodal Placeholder Strings: Introduced a new CLI flag
--mm-placeholder-str-override
that allows users to specify custom placeholder strings for different modalities (e.g., video, image). This removes the need to hardcode these strings within the framework, simplifying the addition of new multimodal models. - Flexible Video Sampling Policies: Added a new CLI flag
--video-media-io-kwargs
which enables passing arbitrary keyword arguments to the video processing pipeline. This allows for flexible and extensible video sampling policies, moving beyond the previously hard-coded 32-frame configuration. - Core Configuration Updates: Updated
ModelConfig
andMultiModalConfig
to include the newvideo_media_io_kwargs
andmm_placeholder_str_override
fields, ensuring these configurations are propagated throughout the system. - Media Connector Enhancements: Modified the
MediaConnector
to accept and utilize the newvideo_media_io_kwargs
, ensuring that video fetching and processing can leverage the newly introduced flexible policies. - Video Loader API Extension: Extended the
VideoLoader
andVideoMediaIO
interfaces to accept and pass through arbitrary keyword arguments, allowing custom video loaders to define and use their own specific parameters for video processing.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request effectively addresses the stated pain points by introducing configurable multimodal placeholder strings and flexible video sampling policies via CLI flags. The changes are well-integrated across the configuration, argument parsing, and multimodal utility layers. New test cases have been added to validate the parsing of the new CLI arguments and the propagation of video processing parameters. The overall code quality is good, with clear intent and appropriate use of dataclasses and type hints. One minor type mismatch was identified in the EngineArgs
definition for mm_placeholder_str_override
.
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
1d1f103
to
aa4277e
Compare
70368b9
to
da55875
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
da55875
to
3f75b64
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
fe04d60
to
5d0b05b
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
2e41ab8
to
1f7417d
Compare
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary: Test Plan: Reviewers: Subscribers: Tasks: Tags: Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for working on this! @ywang96 any further comments?
Perhaps we can later work on a solution similar to #20179 to define the placeholder token inside the model class.... |
Purpose
New model additions currently suffer from two pain points:
_placeholder_str
function in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py#L507. This is now configurable via--mm-placeholder-str-override
so future additions can be contained withinmodel_executor/models/
directory without modifying the vllm framework code.Example flag formats:
--mm-placeholder-str-override '{"video":"<|video_placeholder|>", "image": "<|image_placeholder|>"}'
--media-io-kwargs '{"video": {"num_frames": 40, "fps": 2.0, "foo": "bar"}, "image": {"foo":"bar"} }'
Test Plan
Unit tests:
pytest tests/engine/test_arg_utils.py
pytest tests/multimodal/test_video.py
pytest tests/multimodal/test_utils.py
vllm serve test. Printed & verified args are passed as expected.
vllm serve Qwen/Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype bfloat16 --limit-mm-per-prompt image=5,video=5 --mm-placeholder-str-override '{"video":"<|video_placeholder|>"}' --media-io-kwargs '{"video": {"num_frames": 40, "fps": 2.0, "foo": "bar"}, "image": {"foo":"bar"} }'