You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This work is looking extremely promising! Great job. I wanted to notify you that I've created a project for a drop-in replacement of transformers using the attention sinks approach. For example, loading Llama 2 with attention sinks is now as simple as:
Hello!
This work is looking extremely promising! Great job. I wanted to notify you that I've created a project for a drop-in replacement of
transformers
using the attention sinks approach. For example, loading Llama 2 with attention sinks is now as simple as:If you're interested, you can check out the repository here: https://github.com/tomaarsen/attention_sinks
I ran some experiments over there using the
attention_sinks
Python module, and was able to get some extremely promising results, much like your paper:transformers
: Linear VRAM usage as it doesn't do any windowing. Performance fails after ~4096window_attention
: Constant VRAM usage due to the windowing at 1024 tokens. Fails after ~1024 tokens.attention_sinks
: Constant VRAM usage due to windowing at 4 attention sink tokens + 1020 most recent tokens. Never fails despite constant VRAM usage.The text was updated successfully, but these errors were encountered: