Open
Description
Following the optimization in #13493, I realized that the defragmentation can become much better so that it can further improve the Flash Attention masking.
Currently we defrag the following cache like this:
# before defrag
00000000...11111.......2222222....2010212012012....
# after defrag
000000001111122222222010212012012..................
I.e. we only "fill" the holes, but the sequences remain scattered. We can do better like this:
# new defrag
000000000000111111111222222222222..................
By doing so, the FA-vec masking logic will remain effective even after many generations.