-
Notifications
You must be signed in to change notification settings - Fork 364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
For LLMs already trained with window attention and BOS token #1
Comments
👍 for Mistral |
I think that this should still work. I'm running some experiments using these Attention Sinks here: https://github.com/tomaarsen/attention_sinks from attention_sinks import AutoModel
model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto") I hope to add support for Mistral in the coming days. I'm quite excited about this line of work - my personal experiments match the findings from the paper! There's some information in #5.
|
I've found that @tomaarsen conducted an evaluation of StreamingLLM against the window attention baseline using Mistral-7B. It appears that the model trained with the sliding window attention still requires attention sinks for streaming. For more details, please see this reference: Attention Sinks in Transformers for Endless Fluent Generation. As for why StreamingLLM surpasses dense attention when the input length is within the cache size and the pre-training length, we're still researching. One hypothesis suggests that LLMs might not fully leverage the extensive context provided to them. In some instances, a shorter context might enhance their performance. For further insights, please refer to the "Lost-in-the-Middle" paper and Table 6 in our paper. Thank you, |
Hi, thanks for sharing this impressive work!
Same question after carefully reading the paper. Any explanations or references to elaborate would be appreciated! |
I provided a detailed explanation in #33 (comment). Please let me know if you have further questions! |
Truly helpful, thanks! |
Nice work!
I am wondering whether this attention sink magic is still needed for LLMs that has been already trained with window attention (e.g. mistral). While I am curious about this, I still think attention sink is a better way. Since it could be used on almost any LLMs either trained with or without window attention.
And in particular for Llama or say LLMs with an BOS token, attention sink can be viewed as a soft version of hard truncation of farthest tokens where sink token is very much like the BOS token and the position ids are also properly reorganized. This makes me further question about whether attention sink would work well on long-context scenarios (e.g. longeval)? Although it seems that streameval is testing the long-context modeling ability, I did not get the reason why streamllm can outperform dense attention when the context length lies between cache size and pretrained length.
BTW, I am not very certain about what window attention with recompute does, and why it could work?
The text was updated successfully, but these errors were encountered: