[Feature]: Inquiry about Multi-modal Support in VLLM for MiniCPM-V2.6

### 🚀 The feature, motivation and pitch

I am currently exploring the capabilities of the VLLM library and am interested in understanding its support for multi-modal inputs, particularly for models like MiniCPM-V2.6. I would like to know if VLLM is designed to handle multi-image and video inputs for such models.

### Alternatives

1. **Model of Interest**: MiniCPM-V2.6
2. **Types of Input**: Multi-image and video
3. **Current Understanding**:
   - I have reviewed the documentation and initial examples provided with VLLM.
  - It seems that both `multiple 'image_url' input` and `list value in image_url` is currently not supported.
  - However, I am not sure if it supports the processing of multiple images or videos as input to a model like MiniCPM-V2.6.
## Questions
 1. Does VLLM support the integration of MiniCPM-V2.6 for processing multi-image and video inputs?
 2. If yes, could you provide an example or a guide on how to set up and use this feature?
 3. If not, are there any plans to extend VLLM's capabilities to support such inputs in the future?

### Additional context

![image](https://github.com/user-attachments/assets/627dd626-dee1-41cb-b231-9c13163e9174)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Uh oh!

[Feature]: Inquiry about Multi-modal Support in VLLM for MiniCPM-V2.6 #7546

🚀 The feature, motivation and pitch

Alternatives

Questions

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

[Feature]: Inquiry about Multi-modal Support in VLLM for MiniCPM-V2.6 #7546

Description

🚀 The feature, motivation and pitch

Alternatives

Questions

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions