-
-
Notifications
You must be signed in to change notification settings - Fork 7.7k
[Core] Cast multimodal input in hf processor #18862
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
Overall this approach does make more sense than what I originally did in #18756, thanks! |
b6d42a2
to
a25576c
Compare
Can you fix the failing tests? Looks like you need to import |
Head branch was pushed to by a user without write access
e61670b
to
0292449
Compare
Ah sorry, should be fixed in 0292449. Looks like the type annotation needs to be a string to support older versions of python. |
Head branch was pushed to by a user without write access
Looks like not all items in the dict are tensors which breaks CI. 9d0d47c should fix that. |
Looks like V1 test is hanging in this PR, can you investigate it? |
9d0d47c
to
b48017f
Compare
I rebased onto main, let's see if this fixes it or gives some more actionable error message |
Looks like the v1-tests are still failing. I'm waiting for huggingface to approve my model access request, then I can try and reproduce locally |
95deab1
to
b1344b4
Compare
Looks like the deadlocking tests are due to the fact that vllm uses fork as a default multiprocessing method. On
and huggingface tokenizers already complain about not being compatible with fork
On main this doesn't cause any deadlocks but it looks like with this change it breaks. I can reproduce this locally as well.
Whereas running them in one go deadlocks:
Setting Not sure how to best deal with this. I guess switching the default vllm multiprocessing method might be a larger change. |
Why would this PR cause the tests to fail when they weren't before though? |
You're right it's surprising especially since we don't actually change where the huggingface preprocessor is called. |
cc @njhill any idea? |
The tokenizers warning is a red herring and shouldn't be an issue. I don't think we should change the mp method in the test to workaround. If you can repro locally, you could check where things may be stuck by running with env var |
Here are the dumps from both processes:
|
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
Signed-off-by: Lukas Geiger <lukas.geiger94@gmail.com>
b1344b4
to
5d5c4b4
Compare
Looks like it's stuck in transformers here: huggingface/transformers@a31fa21#diff-d3478155ac25ae1107d16a4464001bfd54770bf12cd9bab881233a1b4d216e3fR240 🤔 |
This is a follow up to #18756 and instead directly does the multimodal input casting as part of the huggingface preprocessing.
I think this is a bit cleaner and has two advantages: It moves the blocking casting/copy off the main thread and now does the conversion before serialisation which also reduces the amount of data that needs to be serialised and de-serialised. @DarkLight1337 Let me know if you see any disadvantages of doing this.
Before:


After:
I've done some quick benchmark and for some models I'm seeing very small improvement in throughput (0.5%-0.7%).