Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance]: InternVL multi image speed is not improved compare to original #9483

Open
1 task done
luohao123 opened this issue Oct 18, 2024 · 20 comments
Open
1 task done
Labels
help wanted Extra attention is needed performance Performance-related issues

Comments

@luohao123
Copy link

Your current environment

The output of `python collect_env.py`
latest vllm 0.6.1

Model Input Dumps

tt

🐛 Describe the bug

InternVL multi image speed is slower than original

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.
@luohao123 luohao123 added the bug Something isn't working label Oct 18, 2024
@DarkLight1337
Copy link
Member

Could you elaborate more? What do you mean by the original speed?

@luohao123
Copy link
Author

luohao123 commented Oct 18, 2024

Compare with torch, same device, same dtype (float16, V100). (torch means hf with flashattn default)

Single image faster about 20%, while multiple image are slower, A100 got same result.

@DarkLight1337 DarkLight1337 added performance Performance-related issues and removed bug Something isn't working labels Oct 18, 2024
@DarkLight1337 DarkLight1337 changed the title [Bug]: InternVL multi image speed is not improved compare to original [Performance]: InternVL multi image speed is not improved compare to original Oct 18, 2024
@DarkLight1337
Copy link
Member

Can you show the scripts you used to measure the performance of HF vs vLLM?

@luohao123
Copy link
Author

Hi, the test based on internvl 8b model, have u guys tested vllm speed improvement on multiple images? I am not lying. Multiple images actually slower than torch. For some inhouse issue, I didn't got a chance to paste code here, but I think you guys might can be easily replicate the result.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Oct 18, 2024

Hi, the test based on internvl 8b model, have u guys tested vllm speed improvement on multiple images? I am not lying. Multiple images actually slower than torch. For some inhouse issue, I didn't got a chance to paste code here, but I think you guys might can be easily replicate the result.

No, we have not tested the speed for multiple images (benchmarking work for multi-modal models is still in the early stages). Since vLLM was originally designed around language generation, most of vLLM's optimizations don't currently work on the vision encoder part of the model, which may explain the decrease in speed when more images are passed. There may also be CPU bottlenecks associated with image preprocessing.

We are still busy with making multi-modal support feature complete, so it may take a while before we can focus on optimization - any help is welcome!

@DarkLight1337 DarkLight1337 added the help wanted Extra attention is needed label Oct 18, 2024
@luohao123
Copy link
Author

Hi, am not expert in accelerate, but as far as I can understand, why encoder-decoder can not use flashattn?

@luohao123
Copy link
Author

How about using flashatten2 package, or using torch inside sdpa?

@Jeremy-J-J
Copy link

Same problem

@torinchen
Copy link

noooop

when diving into the code, i found that internvl2 uses xformer's attention not the naive one, so, the slow speed maybe come from other part.

@torinchen
Copy link

when diving into the code, i found that internvl2 uses xformer's attention not the naive one, so, the slow speed maybe come from other part.

@torinchen

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/intern_vit.py#L271C1-L285C17

The code of vllm is being optimized rapidly, but at least today 20241029 is still using scaled_dot_product_attention

        if self.qk_normalization:
            B_, N_, H_, D_ = q.shape
            q = self.q_norm.forward_native(q.flatten(-2,
                                                     -1)).view(B_, N_, H_, D_)
            k = self.k_norm.forward_native(k.flatten(-2,
                                                     -1)).view(B_, N_, H_, D_)
        q = q.transpose(1, 2)
        k = k.transpose(1, 2)
        v = v.transpose(1, 2)


        x = F.scaled_dot_product_attention(q, k, v, scale=self.scale)
        x = x.transpose(1, 2).view(B, N, -1)


        x = self.proj(x)
        return x

https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/intern_vit.py#L353C4-L371C76

@noooop
Copy link
Contributor

noooop commented Oct 29, 2024

when diving into the code, i found that internvl2 uses xformer's attention not the naive one, so, the slow speed maybe come from other part.

vllm installs xformers by default, so internvl2 uses xformer's attention.

Although xformer is slower than flash attention, it is not significant.

I agree with @torinchen "the slow speed maybe come from other part."

@noooop
Copy link
Contributor

noooop commented Oct 29, 2024

Please submit a test code to reproduce this issues.

I can help locate the problem using profiler.

@torinchen
Copy link

torinchen commented Oct 29, 2024

test.zip
i test the online and offline mode, the gap is significant (under lmdeploy, the gap is zero btw )
mail: chen.xin.mail@foxmail.com

@noooop
Copy link
Contributor

noooop commented Oct 29, 2024

Can this issues be reproduced using InternVL2-8B?

https://huggingface.co/OpenGVLab/InternVL2-8B

@torinchen
Copy link

yes, my model is just a sft version InternVL2-8B

@noooop
Copy link
Contributor

noooop commented Oct 30, 2024

@torinchen @luohao123

I can't reproduce this issues

code https://github.com/noooop/light-vllm/tree/main/benchmarks/InternVL2

image preprocessing time is not included

transformers 4.37.2 + flash_attn 2.6.3

use_flash_attn=True
single-image single-round conversation 1.37133834199999
multi-image single-round conversation 3.133497854799998

transformers 4.45.2 + flash_attn 2.6.3
use_flash_attn=True

single-image single-round conversation 1.4907942284
multi-image single-round conversation 3.1399439033000136

transformers 4.45.2 + vllm==v0.6.3.post1

single-image single-round conversation 1.367961298399996
multi-image single-round conversation 2.787156264600026

I'm not sure if it's related to the slow speed of image preprocessing #9238

@luohao123
Copy link
Author

luohao123 commented Oct 30, 2024

Even though I think the time speed up is very limited, can be treated as not fast as expect.

I forgot my precise number before, but from my human feeling, without streaming, the response of vllm is not fast.

My image is not big, just normal 800 maxium input.

@noooop
Copy link
Contributor

noooop commented Oct 30, 2024

Even though I think the time speed up is very limited, can be treated as not fast as expect.

For a single request, flash attn is already very fast.

vllm can only batched multiple requests to increase throughput.

I didn't forgot my precise number before, but from my human feeling, without streaming, the response of vllm is not fast.

Do you use openai.api_server or offline inference.

@torinchen
Copy link

@torinchen @luohao123

I can't reproduce this issues

code https://github.com/noooop/light-vllm/tree/main/benchmarks/InternVL2

image preprocessing time is not included

transformers 4.37.2 + flash_attn 2.6.3

use_flash_attn=True single-image single-round conversation 1.37133834199999 multi-image single-round conversation 3.133497854799998

transformers 4.45.2 + flash_attn 2.6.3 use_flash_attn=True

single-image single-round conversation 1.4907942284 multi-image single-round conversation 3.1399439033000136

transformers 4.45.2 + vllm==v0.6.3.post1

single-image single-round conversation 1.367961298399996 multi-image single-round conversation 2.787156264600026

I'm not sure if it's related to the slow speed of image preprocessing #9238

try online mode, i saw speed gap there.

@noooop
Copy link
Contributor

noooop commented Oct 31, 2024

try online mode, i saw speed gap there.

Sorry, I'm not very familiar with the webserver part.

Many issues mentioned that image preprocessing is slow. I think it is more likely to be caused by this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed performance Performance-related issues
Projects
None yet
Development

No branches or pull requests

5 participants