-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance]: InternVL multi image speed is not improved compare to original #9483
Comments
Could you elaborate more? What do you mean by the original speed? |
Compare with torch, same device, same dtype (float16, V100). (torch means hf with flashattn default) Single image faster about 20%, while multiple image are slower, A100 got same result. |
Can you show the scripts you used to measure the performance of HF vs vLLM? |
Hi, the test based on internvl 8b model, have u guys tested vllm speed improvement on multiple images? I am not lying. Multiple images actually slower than torch. For some inhouse issue, I didn't got a chance to paste code here, but I think you guys might can be easily replicate the result. |
No, we have not tested the speed for multiple images (benchmarking work for multi-modal models is still in the early stages). Since vLLM was originally designed around language generation, most of vLLM's optimizations don't currently work on the vision encoder part of the model, which may explain the decrease in speed when more images are passed. There may also be CPU bottlenecks associated with image preprocessing. We are still busy with making multi-modal support feature complete, so it may take a while before we can focus on optimization - any help is welcome! |
Hi, am not expert in accelerate, but as far as I can understand, why encoder-decoder can not use flashattn? |
How about using flashatten2 package, or using torch inside sdpa? |
Same problem |
when diving into the code, i found that internvl2 uses xformer's attention not the naive one, so, the slow speed maybe come from other part. |
|
vllm installs xformers by default, so internvl2 uses xformer's attention. Although xformer is slower than flash attention, it is not significant. I agree with @torinchen "the slow speed maybe come from other part." |
Please submit a test code to reproduce this issues. I can help locate the problem using profiler. |
test.zip |
Can this issues be reproduced using InternVL2-8B? |
yes, my model is just a sft version InternVL2-8B |
I can't reproduce this issues code https://github.com/noooop/light-vllm/tree/main/benchmarks/InternVL2 image preprocessing time is not included transformers 4.37.2 + flash_attn 2.6.3 use_flash_attn=True transformers 4.45.2 + flash_attn 2.6.3 single-image single-round conversation 1.4907942284 transformers 4.45.2 + vllm==v0.6.3.post1 single-image single-round conversation 1.367961298399996 I'm not sure if it's related to the slow speed of image preprocessing #9238 |
Even though I think the time speed up is very limited, can be treated as not fast as expect. I forgot my precise number before, but from my human feeling, without streaming, the response of vllm is not fast. My image is not big, just normal 800 maxium input. |
For a single request, flash attn is already very fast. vllm can only batched multiple requests to increase throughput.
Do you use openai.api_server or offline inference. |
try online mode, i saw speed gap there. |
Sorry, I'm not very familiar with the webserver part. Many issues mentioned that image preprocessing is slow. I think it is more likely to be caused by this problem. |
Your current environment
The output of `python collect_env.py`
Model Input Dumps
tt
🐛 Describe the bug
InternVL multi image speed is slower than original
Before submitting a new issue...
The text was updated successfully, but these errors were encountered: