-
Notifications
You must be signed in to change notification settings - Fork 195
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Refactor] Formalize NHD/HND layout annotation (#85)
We support two different layout annotations (NHD and HND, N: sequence lenght, H: number of heads, D: head dimension) for QKV matrices. HND layout is beneficial when $D$ is small or the data type bit-width is small (in which case the consecutive length $D$ vector in NHD layout can not fulfill a cacheline). However, HND layout is not useful for query matrix as we only access query once and pin their value in register/smem. The natural layout of the query matrix is NHD which is the direct output of $x \cdot W_q$, and KV-Cache (either paged/ragged tensor) could have different layouts. In this PR we formalize the use of NHD/HND layout annotations: 1. Query matrix always uses NHD layout, no need for any annotations. 2. KV-Cache can have either NHD or HND layout, user should specify their layout. 3. Layout annotations could be hidden from users in Python APIs because they can be inferred from shape. This PR also adds support for NHD paged-kv cache (we only support HND paged-kv cache before this PR).
- Loading branch information
Showing
28 changed files
with
720 additions
and
527 deletions.
There are no files selected for viewing
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.