[Kernel][Core][WIP] Tree attention and parallel decoding #4325

yukavio · 2024-04-24T07:45:15Z

The implementation of tree attention which will been used for accelerating the speculative decodign and parallel decoding (with sampling params: n lager than 1 and beam_search = false. Theoretically, the case where beam_search=True can also be implemented, but it has not been implemented yet in this PR.）

…hout implement tree attention kernel

mpjlu · 2024-05-30T07:00:52Z

@yukavio great work, will you continue work on this PR.

yukavio · 2024-06-03T10:26:34Z

@yukavio great work, will you continue work on this PR.
There are certain performance issues with the kernel implemented using Triton. If a high acceleration ratio is required, it may be necessary to modify the implementation based on CUDA. However, I'm sorry to say that I currently don't have the time to complete this task.

github-actions · 2024-10-29T02:00:39Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

mergify · 2024-10-29T02:01:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. @yukavio please rebase it. https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kavioyu added 7 commits April 16, 2024 19:13

merge different seqs in seqs group in to once attention inference wit…

bdc863d

…hout implement tree attention kernel

tree width = 2 have been tested, but there is error when width >2

dadbed1

temp

7596ad8

tested

1534c5c

fix early stop

a3849d1

fix code style

af98a27

fix bug

92ebde5

cadedaniel self-requested a review April 24, 2024 17:29

add duration check

ce76cc7

github-actions bot added the stale label Oct 29, 2024

mergify bot added the needs-rebase label Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Kernel][Core][WIP] Tree attention and parallel decoding #4325

[Kernel][Core][WIP] Tree attention and parallel decoding #4325

yukavio commented Apr 24, 2024

mpjlu commented May 30, 2024

yukavio commented Jun 3, 2024

github-actions bot commented Oct 29, 2024

mergify bot commented Oct 29, 2024

[Kernel][Core][WIP] Tree attention and parallel decoding #4325

Are you sure you want to change the base?

[Kernel][Core][WIP] Tree attention and parallel decoding #4325

Conversation

yukavio commented Apr 24, 2024

mpjlu commented May 30, 2024

yukavio commented Jun 3, 2024

github-actions bot commented Oct 29, 2024

mergify bot commented Oct 29, 2024