Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python Module as a drop-in replacement for transformers using Attention Sinks #5

Open
tomaarsen opened this issue Oct 2, 2023 · 1 comment

Comments

@tomaarsen
Copy link
Contributor

Hello!

This work is looking extremely promising! Great job. I wanted to notify you that I've created a project for a drop-in replacement of transformers using the attention sinks approach. For example, loading Llama 2 with attention sinks is now as simple as:

from attention_sinks import AutoModel

model = AutoModel.from_pretrained("meta-llama/Llama-2-7b-hf", device_map="auto")

If you're interested, you can check out the repository here: https://github.com/tomaarsen/attention_sinks

I ran some experiments over there using the attention_sinks Python module, and was able to get some extremely promising results, much like your paper:

  1. transformers: Linear VRAM usage as it doesn't do any windowing. Performance fails after ~4096
  2. window_attention: Constant VRAM usage due to the windowing at 1024 tokens. Fails after ~1024 tokens.
  3. attention_sinks: Constant VRAM usage due to windowing at 4 attention sink tokens + 1020 most recent tokens. Never fails despite constant VRAM usage.

llama_2_7b_ppl_vram_old

  • Tom Aarsen
@Guangxuan-Xiao
Copy link
Collaborator

Guangxuan-Xiao commented Oct 3, 2023

Great implementation! I'm looking forward to integrating our work into Huggingface transformers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants