Skip to content

[Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. #20105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

huachenheli
Copy link
Contributor

@huachenheli huachenheli commented Jun 26, 2025

Purpose

New model additions currently suffer from two pain points:

  1. Adding new multimodal models or modalities requires changing hard-coded _placeholder_str function in https://github.com/vllm-project/vllm/blob/main/vllm/entrypoints/chat_utils.py#L507. This is now configurable via --mm-placeholder-str-override so future additions can be contained within model_executor/models/ directory without modifying the vllm framework code.
  2. Video input processing currently uses a hard-coded 32 frame config. We want to allow flexible & extensible media io policies in the future.

Example flag formats:
--mm-placeholder-str-override '{"video":"<|video_placeholder|>", "image": "<|image_placeholder|>"}'
--media-io-kwargs '{"video": {"num_frames": 40, "fps": 2.0, "foo": "bar"}, "image": {"foo":"bar"} }'

Test Plan

  • Unit tests:
    pytest tests/engine/test_arg_utils.py
    pytest tests/multimodal/test_video.py
    pytest tests/multimodal/test_utils.py

  • vllm serve test. Printed & verified args are passed as expected.

vllm serve Qwen/Qwen2.5-VL-7B-Instruct --port 8000 --host 0.0.0.0 --dtype bfloat16 --limit-mm-per-prompt image=5,video=5 --mm-placeholder-str-override '{"video":"<|video_placeholder|>"}' --media-io-kwargs '{"video": {"num_frames": 40, "fps": 2.0, "foo": "bar"}, "image": {"foo":"bar"} }'

…strings.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@huachenheli huachenheli marked this pull request as ready for review June 26, 2025 00:18
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @huachenheli, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the flexibility and extensibility of multimodal model support by introducing configurable placeholder strings and dynamic video sampling policies. These changes empower users to easily integrate new multimodal models and customize video input processing without modifying core framework code, making the system more adaptable to future model additions and diverse use cases.

Highlights

  • Configurable Multimodal Placeholder Strings: Introduced a new CLI flag --mm-placeholder-str-override that allows users to specify custom placeholder strings for different modalities (e.g., video, image). This removes the need to hardcode these strings within the framework, simplifying the addition of new multimodal models.
  • Flexible Video Sampling Policies: Added a new CLI flag --video-media-io-kwargs which enables passing arbitrary keyword arguments to the video processing pipeline. This allows for flexible and extensible video sampling policies, moving beyond the previously hard-coded 32-frame configuration.
  • Core Configuration Updates: Updated ModelConfig and MultiModalConfig to include the new video_media_io_kwargs and mm_placeholder_str_override fields, ensuring these configurations are propagated throughout the system.
  • Media Connector Enhancements: Modified the MediaConnector to accept and utilize the new video_media_io_kwargs, ensuring that video fetching and processing can leverage the newly introduced flexible policies.
  • Video Loader API Extension: Extended the VideoLoader and VideoMediaIO interfaces to accept and pass through arbitrary keyword arguments, allowing custom video loaders to define and use their own specific parameters for video processing.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@mergify mergify bot added frontend multi-modality Related to multi-modality (#4194) labels Jun 26, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request effectively addresses the stated pain points by introducing configurable multimodal placeholder strings and flexible video sampling policies via CLI flags. The changes are well-integrated across the configuration, argument parsing, and multimodal utility layers. New test cases have been added to validate the parsing of the new CLI arguments and the propagation of video processing parameters. The overall code quality is good, with clear intent and appropriate use of dataclasses and type hints. One minor type mismatch was identified in the EngineArgs definition for mm_placeholder_str_override.

Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
@ywang96 ywang96 self-assigned this Jun 26, 2025
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch 2 times, most recently from 1d1f103 to aa4277e Compare June 26, 2025 04:37
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch 2 times, most recently from 70368b9 to da55875 Compare June 26, 2025 04:45
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch from da55875 to 3f75b64 Compare June 26, 2025 04:54
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch from fe04d60 to 5d0b05b Compare June 26, 2025 16:35
@huachenheli huachenheli changed the title Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. [Frontend] Support configurable mm placeholder strings & flexible video sampling policies via CLI flags. Jun 26, 2025
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
@huachenheli huachenheli force-pushed the huachenheli/mm_configs branch from 2e41ab8 to 1f7417d Compare June 27, 2025 04:00
@huachenheli huachenheli requested a review from njhill as a code owner June 27, 2025 04:00
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Summary:

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:
Signed-off-by: Chenheli Hua <huachenheli@outlook.com>
Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for working on this! @ywang96 any further comments?

@DarkLight1337
Copy link
Member

Perhaps we can later work on a solution similar to #20179 to define the placeholder token inside the model class....

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
frontend multi-modality Related to multi-modality (#4194)
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

3 participants