Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP][1/N] Chunked Prefill #3106

Closed
wants to merge 38 commits into from

Conversation

rkooo567
Copy link
Collaborator

@rkooo567 rkooo567 commented Feb 29, 2024

WIP. RFC link: #3130

@tdene
Copy link

tdene commented Mar 1, 2024

Is the goal of this PR and #3121 not the same?

@robertgshaw2-redhat
Copy link
Collaborator

robertgshaw2-redhat commented Mar 1, 2024

@rkooo567 this is an awesome initiative, thanks for kicking it off! Quick note on this PR.

There is already an attention kernel context_attention_fwd in vLLM that covers the case of prefill with kv cache. It was originally created for Manual Prefix Caching (and will be used for Automatic Prefix Caching as well). I have not looked too much into the implementation of context_attention_fwd (so I'm not sure of the performance relative to the flash attention implementation), but just wanted to make sure you were aware

Perhaps we can leverage these kernels in the Automatic Prefix caching diff

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Mar 1, 2024

There is already an attention kernel context_attention_fwd in vLLM that covers the case of prefill with kv cache. It was originally created for Manual Prefix Caching (and will be used for Automatic Prefix Caching as well). I have not looked too much into the implementation of context_attention_fwd (so I'm not sure of the performance relative to the flash attention implementation), but just wanted to make sure you were aware

Yes! I am aware of this. I think it must be possible to use this kernel as well, but our internal benchmark shows it is pretty slow actually (I don't know much detail here though).

@robertgshaw2-redhat
Copy link
Collaborator

Thanks @rkooo567

@zhr2001
Copy link

zhr2001 commented Apr 7, 2024

After trying with the provided code, I think the implementation is not completely finished and it still leaves some bugs.

@rkooo567
Copy link
Collaborator Author

rkooo567 commented Apr 7, 2024

@zhr2001 actually this is stale. did you try #3884 (comment)?

@rkooo567 rkooo567 closed this Apr 7, 2024
@zhr2001
Copy link

zhr2001 commented Apr 8, 2024

I have tried.It's excellent, thanks for your contribution.
However, I just got result 7.09it/s for benchmark_serving, while without chunkedprefill the result was 3.44it/s within llama-13B,A40,tp2, max-num-batched-token and sonnet data-set.I wonder how to get the following result in the picture below.
chunk

@rkooo567
Copy link
Collaborator Author

@zhr2001 I am starting the benchmark from tomorrow. Btw, what does 7.09it/s for benchmark_serving mean (does it mean it is better or worse)? I am not familiar with this benchmark

For the result above, it is the result with our internal repo (the implementation is almost the same, but our repo has cuda-graph for prefill whereas OSS currently doesn't have it). The environment is basically sending a requests with similar length with a certain QPS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants