-
-
Notifications
You must be signed in to change notification settings - Fork 11k
[Misc] IO Processor plugins for pooling models #22820
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] IO Processor plugins for pooling models #22820
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an alternative implementation for multimodal output generation using plugins, defining a new entry point LLMForImageTiling and a plugin mechanism for data pre- and post-processing. The changes are a good step forward, but I've identified a critical import error due to a typo in a module path that will break the code. Additionally, there are several high-severity issues concerning API consistency in type definitions and a potential for silent failures in the plugin loading mechanism, which could make debugging difficult. I've provided suggestions to address these points.
vllm/__init__.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a typo in the import path. It should be vllm.entrypoints.llm_for_image_tiling instead of vllm.entrypoint.llm_for_image_tiling (missing 's' in entrypoints). This will cause an ImportError at runtime.
| from vllm.entrypoint.llm_for_image_tiling import LLMForImageTiling | |
| from vllm.entrypoints.llm_for_image_tiling import LLMForImageTiling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable name input shadows the built-in function input(). This is a discouraged practice as it can lead to unexpected behavior and bugs if the built-in function is needed later in the code. It's recommended to use a different name, for example, model_input, especially in an example file that users might copy.
| input = {"type": "url", "data": image_url} | |
| output = llm.predict(input) | |
| model_input = {"type": "url", "data": image_url} | |
| output = llm.predict(model_input) |
vllm/inputs/data.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The format field in ImagePrompt is defined as a required field. However, the example usage in examples/offline_inference/prithvi_geospatial_mae_multimodal_processor.py does not provide it. This inconsistency can lead to runtime errors or confusion for users of the API. To align with the example and make the API more flexible, consider making the format field optional using NotRequired.
| format: str | |
| format: NotRequired[str] |
vllm/outputs.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type hint for the type parameter in ImageRequestOutput.__init__ is str, but it is assigned to self.type which is annotated as Literal["path", "object"]. This is a type inconsistency. The parameter's type hint should be Literal["path", "object"] to match the attribute's type for correctness and clarity.
| def __init__(self, type: str, format: str, data: Any): | |
| def __init__(self, type: Literal["path", "object"], format: str, data: Any): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The broad except Exception: pass will silently ignore any errors that occur during plugin loading, including assertion errors or import errors within the plugin's loading function. This can make debugging issues with plugins very difficult. It's better to at least log the exception to provide visibility into why a plugin might have failed to load.
| except Exception: | |
| pass | |
| except Exception: | |
| logger.warning("Failed to load plugin %s.", name, exc_info=True) |
|
I think the plugin approach would make more sense in an online setting. Because if you're using an vLLM image or existing hosted vLLM server that is difficult to customize, then being able to load plugins makes a lot of sense. For the offline use case, since you're writing python code and are responsible for the deployment of the code anyway, it doesn't make a lot of difference since you can just call LLM encode and do the pre- and post-processing in the code that calls LLM.encode(). |
Hey @maxdebayser , thanks a lot for your comment. Totally agree that the offline case is not as useful as the online one, if you're writing the code why not implement everything. I have used it only to establish this plugin mechanism. I have the online version almost ready and I will push it soon. Regarding the plugins, yes, they might be tricky to support. However I believe that the entrypoint part of vLLM is pretty stable so far, at least not as fluid as the core. I like the idea of having this part as plugins because you can come up with new processing methods that can be applied to models without having to hack vLLM. Perhaps even control it dinamically with an env var that would be quite nice in an online setting. |
examples/offline_inference/prithvi_geospatial_mae_multimodal_processor.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would move this to the example script as well, since this can't be used in online inference and only really makes sense in the context of the example script.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid adding bloat to our library, I much prefer these types of wrappers to be the user's responsibility
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I see your point. I can move this to the examples.
vllm/inputs/data.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the discussion from #14526, I think we can apply the processor instead encode method specifically, while leaving the other methods unchanged. This would allow it to work in online inference as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is basically why I suggested to apply the plugin inside encode. That way you can use encode endpoint with any plugin instead of having to create a new endpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I have structured it like this after discussing with you in the other PR (#21621 (comment)). You were suggesting avoiding touching the async engine and instead put everything in the entrypoint.
I started liking the entrypoint idea because as models show up we can then start offering images endpoints instead of relying on the basic pooling one.
Also, if I do this in the encode, I will have to initialize the plugin in the AsyncLLM init and therefore add an extra argument for starting an engine with/without such data pre/post processor loaded.
Just making sure we are fully aligned before I try implementing it inside the encode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I was mainly concerned about the aggregation part, where multiple inputs correspond to one output. After looking at #14526, I feel that we still need a plugin system to postprocess the outputs (in a one-to-one manner). Sorry for the confusion!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. In my case the plugin is what splits the input into multiple prompts
pre-process the input with the plugin -> generate 1+ prompts -> inference -> postprocess 1+ outputs with plugin -> 1 final output
Could we make one plugin system to serve both purposes? In the end a plugin could either aggregate multiple inference results as well as simply do a 1-to-1. Especially is we overhaul the PoolerOutput data structure we should be able to cater for multiple use-cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the existing methods assume a 1-to-1 correspondance between input and output, I think it would be better to define a new method (both in LLM class and online inference) for this flexibility. Maybe call it LLM.plugin_forward
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you suggesting adding something like plugin_encode or similar to (Async-)LLM? That can work for me. Would you still prefer the entrypoint to remain the same for the online mode or would you be in favor of having an endpoint dedicated to images as in this version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A separate endpoint should also be used for online inference. Ideally, we can pass in a parameter to determine which plugin to use for output processing (in case more than one plugin is loaded)
|
This pull request has merge conflicts that must be resolved before it can be |
|
@DarkLight1337 new version here I have added a method in LLM and AsyncLLM and added an entrypoint to the server. For now I like the idea that the entrypoint is specific to images as people might want/need to support others in the future. Also, this allows to keep the data exchanged with the plugin a bit more structured. |
c608cfd to
7dc699f
Compare
1e57eb3 to
a5fcab1
Compare
|
Hey @DarkLight1337 any comments on this new version? I have rebased to the current master. Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay, was caught up with other things. Left some comments.
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
tests/plugins/prithvi_io_processor_plugin/prithvi_io_processor/prithvi_processor.py
Outdated
Show resolved
Hide resolved
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
…oint Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Head branch was pushed to by a user without write access
|
The failing blackwell test is not related. |
…oint Signed-off-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
|
Many thanks @maxdebayser for your contributions. The whole code is indeed more streamlined now. On to the next one now! |
Signed-off-by: Christian Pinto <christian.pinto@ibm.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
Signed-off-by: Christian Pinto <christian.pinto@ibm.com> Signed-off-by: Max de Bayser <mbayser@br.ibm.com> Co-authored-by: Max de Bayser <mbayser@br.ibm.com>
This PR introduces the ability of pre/post processing pooling models input/output via an IO Processor plugin. This allows users to feed complex data structures to vLLM that are then parsed into prompts by the plugin, and generate any type of data out of the model output (e.g., images).
Main features:
encode_with_io_processoradded toLLMandAsyncLLMfor offline inference, and a new endpoint/io_processor_poolingfor online serving.vllm.io_processor_plugins.EngineArgsargumentio_processor_pluginor via a custom field (io_processor_plugin) in the model configuration (config.json).This is an alternative implementation to #21621.
@DarkLight1337 @maxdebayser @mgazz
CLOSE #12249