-
-
Notifications
You must be signed in to change notification settings - Fork 5.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP][1/N] Chunked Prefill #3106
Conversation
Is the goal of this PR and #3121 not the same? |
@rkooo567 this is an awesome initiative, thanks for kicking it off! Quick note on this PR. There is already an attention kernel Perhaps we can leverage these kernels in the Automatic Prefix caching diff |
Yes! I am aware of this. I think it must be possible to use this kernel as well, but our internal benchmark shows it is pretty slow actually (I don't know much detail here though). |
Thanks @rkooo567 |
After trying with the provided code, I think the implementation is not completely finished and it still leaves some bugs. |
@zhr2001 actually this is stale. did you try #3884 (comment)? |
@zhr2001 I am starting the benchmark from tomorrow. Btw, what does For the result above, it is the result with our internal repo (the implementation is almost the same, but our repo has cuda-graph for prefill whereas OSS currently doesn't have it). The environment is basically sending a requests with similar length with a certain QPS |
WIP. RFC link: #3130