Skip to content

Conversation

@christian-pinto
Copy link
Contributor

@christian-pinto christian-pinto commented Aug 13, 2025

This PR introduces the ability of pre/post processing pooling models input/output via an IO Processor plugin. This allows users to feed complex data structures to vLLM that are then parsed into prompts by the plugin, and generate any type of data out of the model output (e.g., images).

Main features:

  • A new endpoint for offline inference encode_with_io_processor added to LLM and AsyncLLM for offline inference, and a new endpoint /io_processor_pooling for online serving.
  • A new plugin group vllm.io_processor_plugins.
  • IO Processor Plugins are instantiated on a per instance basis via an EngineArgs argument io_processor_plugin or via a custom field (io_processor_plugin) in the model configuration (config.json).
  • Example plugin for the Prithvi model included in the testing pipeline as well as examples for both offline inference and online serving.
  • Documentation updated

This is an alternative implementation to #21621.

@DarkLight1337 @maxdebayser @mgazz

CLOSE #12249

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

@mergify mergify bot added documentation Improvements or additions to documentation frontend labels Aug 13, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an alternative implementation for multimodal output generation using plugins, defining a new entry point LLMForImageTiling and a plugin mechanism for data pre- and post-processing. The changes are a good step forward, but I've identified a critical import error due to a typo in a module path that will break the code. Additionally, there are several high-severity issues concerning API consistency in type definitions and a potential for silent failures in the plugin loading mechanism, which could make debugging difficult. I've provided suggestions to address these points.

vllm/__init__.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There is a typo in the import path. It should be vllm.entrypoints.llm_for_image_tiling instead of vllm.entrypoint.llm_for_image_tiling (missing 's' in entrypoints). This will cause an ImportError at runtime.

Suggested change
from vllm.entrypoint.llm_for_image_tiling import LLMForImageTiling
from vllm.entrypoints.llm_for_image_tiling import LLMForImageTiling

Comment on lines 23 to 25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable name input shadows the built-in function input(). This is a discouraged practice as it can lead to unexpected behavior and bugs if the built-in function is needed later in the code. It's recommended to use a different name, for example, model_input, especially in an example file that users might copy.

Suggested change
input = {"type": "url", "data": image_url}
output = llm.predict(input)
model_input = {"type": "url", "data": image_url}
output = llm.predict(model_input)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The format field in ImagePrompt is defined as a required field. However, the example usage in examples/offline_inference/prithvi_geospatial_mae_multimodal_processor.py does not provide it. This inconsistency can lead to runtime errors or confusion for users of the API. To align with the example and make the API more flexible, consider making the format field optional using NotRequired.

Suggested change
format: str
format: NotRequired[str]

vllm/outputs.py Outdated
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The type hint for the type parameter in ImageRequestOutput.__init__ is str, but it is assigned to self.type which is annotated as Literal["path", "object"]. This is a type inconsistency. The parameter's type hint should be Literal["path", "object"] to match the attribute's type for correctness and clarity.

Suggested change
def __init__(self, type: str, format: str, data: Any):
def __init__(self, type: Literal["path", "object"], format: str, data: Any):

Comment on lines 44 to 45
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The broad except Exception: pass will silently ignore any errors that occur during plugin loading, including assertion errors or import errors within the plugin's loading function. This can make debugging issues with plugins very difficult. It's better to at least log the exception to provide visibility into why a plugin might have failed to load.

Suggested change
except Exception:
pass
except Exception:
logger.warning("Failed to load plugin %s.", name, exc_info=True)

@christian-pinto christian-pinto changed the title [Miscc] multimodal output generation with plugins [Misc] multimodal output generation with plugins Aug 13, 2025
@maxdebayser
Copy link
Contributor

I think the plugin approach would make more sense in an online setting. Because if you're using an vLLM image or existing hosted vLLM server that is difficult to customize, then being able to load plugins makes a lot of sense. For the offline use case, since you're writing python code and are responsible for the deployment of the code anyway, it doesn't make a lot of difference since you can just call LLM encode and do the pre- and post-processing in the code that calls LLM.encode().
Another caveat is that the vLLM code changes quite quickly. Keeping up with upstream code changes is a constant challenge in the vllm-spyre plugin. Since you can't control which version of vLLM users are running it's possible that you might have to add code to handle with a range of vLLM versions.

@christian-pinto
Copy link
Contributor Author

I think the plugin approach would make more sense in an online setting. Because if you're using an vLLM image or existing hosted vLLM server that is difficult to customize, then being able to load plugins makes a lot of sense. For the offline use case, since you're writing python code and are responsible for the deployment of the code anyway, it doesn't make a lot of difference since you can just call LLM encode and do the pre- and post-processing in the code that calls LLM.encode(). Another caveat is that the vLLM code changes quite quickly. Keeping up with upstream code changes is a constant challenge in the vllm-spyre plugin. Since you can't control which version of vLLM users are running it's possible that you might have to add code to handle with a range of vLLM versions.

Hey @maxdebayser ,

thanks a lot for your comment. Totally agree that the offline case is not as useful as the online one, if you're writing the code why not implement everything. I have used it only to establish this plugin mechanism. I have the online version almost ready and I will push it soon.

Regarding the plugins, yes, they might be tricky to support. However I believe that the entrypoint part of vLLM is pretty stable so far, at least not as fluid as the core. I like the idea of having this part as plugins because you can come up with new processing methods that can be applied to models without having to hack vLLM. Perhaps even control it dinamically with an env var that would be quite nice in an online setting.

Copy link
Member

@DarkLight1337 DarkLight1337 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would move this to the example script as well, since this can't be used in online inference and only really makes sense in the context of the example script.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid adding bloat to our library, I much prefer these types of wrappers to be the user's responsibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I see your point. I can move this to the examples.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

Copy link
Member

@DarkLight1337 DarkLight1337 Aug 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the discussion from #14526, I think we can apply the processor instead encode method specifically, while leaving the other methods unchanged. This would allow it to work in online inference as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is basically why I suggested to apply the plugin inside encode. That way you can use encode endpoint with any plugin instead of having to create a new endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I have structured it like this after discussing with you in the other PR (#21621 (comment)). You were suggesting avoiding touching the async engine and instead put everything in the entrypoint.

I started liking the entrypoint idea because as models show up we can then start offering images endpoints instead of relying on the basic pooling one.

Also, if I do this in the encode, I will have to initialize the plugin in the AsyncLLM init and therefore add an extra argument for starting an engine with/without such data pre/post processor loaded.

Just making sure we are fully aligned before I try implementing it inside the encode.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I was mainly concerned about the aggregation part, where multiple inputs correspond to one output. After looking at #14526, I feel that we still need a plugin system to postprocess the outputs (in a one-to-one manner). Sorry for the confusion!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. In my case the plugin is what splits the input into multiple prompts

pre-process the input with the plugin -> generate 1+ prompts -> inference -> postprocess 1+ outputs with plugin -> 1 final output

Could we make one plugin system to serve both purposes? In the end a plugin could either aggregate multiple inference results as well as simply do a 1-to-1. Especially is we overhaul the PoolerOutput data structure we should be able to cater for multiple use-cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the existing methods assume a 1-to-1 correspondance between input and output, I think it would be better to define a new method (both in LLM class and online inference) for this flexibility. Maybe call it LLM.plugin_forward

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting adding something like plugin_encode or similar to (Async-)LLM? That can work for me. Would you still prefer the entrypoint to remain the same for the online mode or would you be in favor of having an endpoint dedicated to images as in this version?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A separate endpoint should also be used for online inference. Ideally, we can pass in a parameter to determine which plugin to use for output processing (in case more than one plugin is loaded)

@mergify
Copy link

mergify bot commented Aug 19, 2025

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @christian-pinto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Aug 19, 2025
@christian-pinto
Copy link
Contributor Author

@DarkLight1337 new version here

I have added a method in LLM and AsyncLLM and added an entrypoint to the server. For now I like the idea that the entrypoint is specific to images as people might want/need to support others in the future. Also, this allows to keep the data exchanged with the plugin a bit more structured.

@christian-pinto christian-pinto force-pushed the multimodal_processing_plugin_entrypoint branch from c608cfd to 7dc699f Compare August 19, 2025 15:21
@mergify mergify bot removed the needs-rebase label Aug 19, 2025
@christian-pinto christian-pinto force-pushed the multimodal_processing_plugin_entrypoint branch from 1e57eb3 to a5fcab1 Compare August 20, 2025 08:24
@christian-pinto
Copy link
Contributor Author

Hey @DarkLight1337 any comments on this new version? I have rebased to the current master. Thanks.

Copy link
Member

@DarkLight1337 DarkLight1337 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the delay, was caught up with other things. Left some comments.

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) August 30, 2025 15:29
@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 30, 2025
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…oint

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
auto-merge was automatically disabled August 30, 2025 18:04

Head branch was pushed to by a user without write access

@maxdebayser
Copy link
Contributor

The failing blackwell test is not related.

…oint

Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
@vllm-bot vllm-bot merged commit 1cb39db into vllm-project:main Sep 1, 2025
69 of 71 checks passed
didier-durand pushed a commit to didier-durand/vllm that referenced this pull request Sep 1, 2025
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
@christian-pinto
Copy link
Contributor Author

Many thanks @maxdebayser for your contributions. The whole code is indeed more streamlined now.
Thanks @DarkLight1337 for your patience in dealing with this PR.
I have spotted a few issues with the links in the docs. I will open a new PR to fix them.

On to the next one now!

eicherseiji pushed a commit to eicherseiji/vllm that referenced this pull request Sep 9, 2025
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
FeiDaLI pushed a commit to FeiDaLI/vllm that referenced this pull request Sep 25, 2025
Signed-off-by: Christian Pinto <christian.pinto@ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation frontend ready ONLY add when PR is ready to merge/full CI is needed v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[RFC]: Hidden states processor

4 participants