-
Notifications
You must be signed in to change notification settings - Fork 27.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate FlashAttention into HF OPT #18439
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. |
query_states_fast = torch.nested_tensor(torch.unbind(query_states, dim=0)) | ||
key_states_fast = torch.nested_tensor(torch.unbind(key_states_fast, dim=0)) | ||
value_states_fast = torch.nested_tensor(torch.unbind(value_states_fast, dim=0)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
won't this result in padding within the nested tensors?
i.e. if query_states is a padded rectangular tensor, calling unbind on it will produce sequences padded to the same length, so we won't be taking advantage of nested tensors to reduce padding. Or am I missing something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm assuming only 0 padding tensor inputs for now. This is a hack just to make those tensors into NestedTensors because currently FlashAttention SDP requires NestedTensor. If FlashAttn SDP supported regular tensor I would just remove these entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
gotcha, thanks for the clarification :)
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. Please note that issues that do not follow the contributing guidelines are likely to be ignored. |
Hi, is there any updates? Coming from https://github.com/HazyResearch/flash-attention/blob/main/usage.md |
Looking forward to the update! |
Thanks @erichan1 ! I will check it out. |
@erichan1 Could you explain the reason for stopping to work on this feature? I think it would be a great implementation for the transformers library. Edit: Is it the case that flash attention is now activated by default with recent versions of torch? If so, I would recommend a HuggingFace blog article to advertise this feature and explain its workings. Currently documentation is rather lacking on flash-attention support. |
Within the Hugging Face ecosystem, it's possible to use BetterTransformer and the optimum library to improve model performance: [1], [2]. @younesbelkada Is flash attention available yet through this? |
@amyeroberts @vincentmin I'm from the PyTorch team. We decided that the best way to provide FlashAttention was to create a new module that was just the component FlashAttention covers, Scaled Dot Product Attention. This is the part which does softmax(Q@K)@v, and doesn't include the in projection and out projection. Since we built this abstraction, we also decided that we could use it to offer some other implementations of SDP, including a memory efficient one that we've built in house which uses less memory than FlashAttn, but is slower. You can just directly use SDP by replacing the necessary chunk of code in your transformer definition. But I'm unsure about a way to use it with a flag you flip in HuggingFace. I'll let @younesbelkada speak to that. I believe BetterTransformer and SDP (which is part of BetterTransformer) support is already part of Optimum. |
@erichan1 @amyeroberts Thank you for the clarifications. I now understand that BetterTransformer should offer the features I am looking for. I encourage you to write a blog post on Huggingface to advertise this to the world! |
Hi @erichan1 @amyeroberts @vincentmin |
Hi, any recent updates on this blogpost for |
Hi @KatarinaYuan |
Thank you!
… On Jun 14, 2023, at 3:33 AM, Younes Belkada ***@***.*** ***@***.***>> wrote:
Hi @KatarinaYuan <https://github.com/KatarinaYuan>
Yes the blogpost is out and is here: https://pytorch.org/blog/out-of-the-box-acceleration/ <https://pytorch.org/blog/out-of-the-box-acceleration/>
—
Reply to this email directly, view it on GitHub <#18439 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKL7G2YPEJ4QVA2DTHM5EBDXLFSOJANCNFSM55M2CJGA>.
You are receiving this because you were mentioned.
|
I use the transformer trainer + FSDP llama training options, model cannot be saved, and unable to use bettertransformer.reverse() convert to original model. I don't know how to deal with this problem. |
Are there any updates on the integration of FlashAttention into HuggingFace Transformers? |
@EwoutH model = model.to_bettertransformer() Check the blogpost: https://pytorch.org/blog/out-of-the-box-acceleration/ for reference cc @fxmarty as well |
is BetterTransformer up to date with FlashAttention v2? |
Hi, BetterTransformer integrates with PyTorch SDPA (for now), and PyTorch has not integrated flash v2 yet: pytorch/pytorch#105602. Hopefully it will be there in Pytorch 2.1. |
Integrate FlashAttention.