For LLMs already trained with window attention and BOS token #1

GeneZC · 2023-10-02T06:15:34Z

Nice work!

I am wondering whether this attention sink magic is still needed for LLMs that has been already trained with window attention (e.g. mistral). While I am curious about this, I still think attention sink is a better way. Since it could be used on almost any LLMs either trained with or without window attention.

And in particular for Llama or say LLMs with an BOS token, attention sink can be viewed as a soft version of hard truncation of farthest tokens where sink token is very much like the BOS token and the position ids are also properly reorganized. This makes me further question about whether attention sink would work well on long-context scenarios (e.g. longeval)? Although it seems that streameval is testing the long-context modeling ability, I did not get the reason why streamllm can outperform dense attention when the context length lies between cache size and pretrained length.

BTW, I am not very certain about what window attention with recompute does, and why it could work?

freckletonj · 2023-10-02T18:04:42Z

👍 for Mistral

tomaarsen · 2023-10-02T19:22:44Z

I think that this should still work. I'm running some experiments using these Attention Sinks here: https://github.com/tomaarsen/attention_sinks
There you can just load "Attention Sink"-adapted models like so:

from attention_sinks import AutoModel

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto")

I hope to add support for Mistral in the coming days.

I'm quite excited about this line of work - my personal experiments match the findings from the paper! There's some information in #5.

Tom Aarsen

Guangxuan-Xiao · 2023-10-04T17:12:45Z

I've found that @tomaarsen conducted an evaluation of StreamingLLM against the window attention baseline using Mistral-7B. It appears that the model trained with the sliding window attention still requires attention sinks for streaming.

For more details, please see this reference: Attention Sinks in Transformers for Endless Fluent Generation.

As for why StreamingLLM surpasses dense attention when the input length is within the cache size and the pre-training length, we're still researching. One hypothesis suggests that LLMs might not fully leverage the extensive context provided to them. In some instances, a shorter context might enhance their performance. For further insights, please refer to the "Lost-in-the-Middle" paper and Table 6 in our paper.

Thank you,
Guangxuan

sdc17 · 2023-10-08T19:57:46Z

Hi, thanks for sharing this impressive work!

BTW, I am not very certain about what window attention with recompute does, and why it could work?

Same question after carefully reading the paper. Any explanations or references to elaborate would be appreciated!

Guangxuan-Xiao · 2023-10-11T21:54:48Z

Hi, thanks for sharing this impressive work!

BTW, I am not very certain about what window attention with recompute does, and why it could work?

Same question after carefully reading the paper. Any explanations or references to elaborate would be appreciated!

I provided a detailed explanation in #33 (comment). Please let me know if you have further questions!

sdc17 · 2023-10-15T19:46:37Z

Hi, thanks for sharing this impressive work!

BTW, I am not very certain about what window attention with recompute does, and why it could work?

Same question after carefully reading the paper. Any explanations or references to elaborate would be appreciated!

I provided a detailed explanation in #33 (comment). Please let me know if you have further questions!

Truly helpful, thanks!

GeneZC changed the title ~~For LLMs already trained with window attention~~ For LLMs already trained with window attention and BOS token Oct 2, 2023

Guangxuan-Xiao closed this as completed Oct 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

For LLMs already trained with window attention and BOS token #1

For LLMs already trained with window attention and BOS token #1

GeneZC commented Oct 2, 2023 •

edited

Loading

freckletonj commented Oct 2, 2023

tomaarsen commented Oct 2, 2023 •

edited

Loading

Guangxuan-Xiao commented Oct 4, 2023

sdc17 commented Oct 8, 2023

Guangxuan-Xiao commented Oct 11, 2023

sdc17 commented Oct 15, 2023

For LLMs already trained with window attention and BOS token #1

For LLMs already trained with window attention and BOS token #1

Comments

GeneZC commented Oct 2, 2023 • edited Loading

freckletonj commented Oct 2, 2023

tomaarsen commented Oct 2, 2023 • edited Loading

Guangxuan-Xiao commented Oct 4, 2023

sdc17 commented Oct 8, 2023

Guangxuan-Xiao commented Oct 11, 2023

sdc17 commented Oct 15, 2023

GeneZC commented Oct 2, 2023 •

edited

Loading

tomaarsen commented Oct 2, 2023 •

edited

Loading