You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think recompute just give the same value of kv state which is saved when using window attention. So what is the difference between recompute and cache version of slide window?
Or it is because no matter what position embedding we use, llm juse learn to set first index a large attention value?
The text was updated successfully, but these errors were encountered:
Let's say the total input tokens are 1024 and the KV-Cache size is 512. While generating the next token, the recomputed representations would totally drop the initial 512 tokens and would just be computed over the most recent 512 tokens as if that was the whole input. Whereas, for cached sliding window, there would be dependency on the first 512 tokens as well. Like this: First 512 representations are trivial. For 513rd, the 1st token representations are dropped, but still the rest of the 511 token repesntations were calculated with the 1st token in context. It works similarly for all the next tokens. Hope it's clear.
I also have a question about this. Assume we use recomputational sliding window, ppl should also be similar to that of regular sliding window since ,whether recomputing or not, it would eventually evict initial tokens?
You can treat recompute as a new round of "prefill", means for each new generated token, you need to prefill the context in the window again, the starting token does not rely on previous tokens. But sliding window does.
I think recompute just give the same value of kv state which is saved when using window attention. So what is the difference between recompute and cache version of slide window?
Or it is because no matter what position embedding we use, llm juse learn to set first index a large attention value?
The text was updated successfully, but these errors were encountered: