4D `attention_mask` support #27539

poedator · 2023-11-16T14:08:45Z

This is implementation for feature request from #27493 custom 4d attention_mask as transformers .forward() argument.

Allowing 4d attention masks to pass thru _prepare_4d_causal_attention_mask() intact
support in OPT (need to build custom positions tensor)
support in Llama (while Llama can accept custom position_ids, I added code to generate them internally)

The benefits of the code are to enable more memory-efficient text generation with tree-based parallel decoding as described in SpecInfer paper

Tagging:
@gante (generate)
@patrickvonplaten (masks)
@younesbelkada @ArthurZucker (text models)

This PR is WiP:

Will add tests
Need advice on how to handle models beyond covered Llama and OPT
May add example for memory-efficient generation

IMPORTANT: this PR makes changes that can only used by few classes of models
requirements to use:

have position_ids argument in .forward() method
use modeling_attn_mask_utils.py::_prepare_4d_attention_mask() function for 4d mask generation

as of 20.12.2023, only a handful (under 20) of transformers model classes meet these criteria. Most of these classes are multimodal, which may require their own use cases for 4D masks. The pure language modelling classes fit to use the 4D mask changes from this PR are only LlamaModel, FalconModel and XGLMModel.

patrickvonplaten · 2023-11-20T10:57:35Z

Generally, I don't have a problem with allowing to pass 4D attention masks! @poedator can you explain your use case a little bit for why you want to pass 4d attention masks?

poedator · 2023-11-20T21:00:59Z

@patrickvonplaten
here is a use example:
Suppose one does beam search and has a starting prefix with tokens 11 22 33 in 4 beams. Now he needs to check candidates with tokens 44, 55, 66, and 77. Present code would pack the beams into a batch of shape (4, 4):

11 22 33  44
11 22 33  55
11 22 33  66
11 22 33  77

and run it with mask of all ones, passing such mask in 2D which gets expanded internally to 4D.

The proposed way would be to have a batch shaped (1, 7):
11 22 33 44 55 66 77
and the 4d mask would have a shape (1, 1, 4, 7) and look like this:

1  1  1  1  0  0  0 
1  1  1  0  1  0  0 
1  1  1  0  0  1  0 
1  1  1  0  0  0  1

with a positions tensor of [0, 1, 2, 3, 3, 3, 3]

At subsequent beam search iterations the mask will reflect which past tokens should the new tokens attend to.
Such mask needs to pass intact.
This saves memory for past_key_values cache and thus allows beam search and other similar inference (like SpecInfer) of longer sequences with limited VRAM.

Another use case is kindly proposed by @UniverseFly below.

UniverseFly · 2023-11-22T18:00:06Z

Very interesting PR! Would this feature also enable SFT packing as mentioned in huggingface/trl#805?

poedator · 2023-11-22T18:12:37Z

Very interesting PR! Would this feature also enable SFT packing as mentioned in huggingface/trl#805?
Sure it would. Just have a separate packing function somewhere - it is beyond the scope of this PR.
Besides, one should be able to pack multiple series of sequences into a batch this way.

UniverseFly · 2023-11-22T18:46:21Z

I tried this branch and the model.forward seems to work fairly well, but model.generate raises errors with the 4D attention mask (with Llama). After some checking, it might be due to the missing logic here:

https://github.com/huggingface/transformers/blob/53a7e7750ff088ffbd7d96c5aeed122cc96b6866/src/transformers/models/llama/modeling_llama.py#L1087-L1124

poedator · 2023-11-22T22:04:16Z

Generate looks like a harder challenge for your methods - each individual sequence will be expanding, thus you'd need to reorder past_kv and mask at each step. I believe that to implement it, you'd need to write custom prepare_inputs_for_generation(), and possibly some more logic.
I'll be happy to test drive it.
On my side I intend to write a PR for more efficient beam search after this PR merges.

patrickvonplaten

Generally the PR looks good to me! (We'd need some tests here).

@ArthurZucker wdyt?

ArthurZucker

Looks alright! but there should not be changes to the forward of the models (IMO)

ArthurZucker · 2023-11-27T14:41:43Z

src/transformers/models/llama/modeling_llama.py

+            if attention_mask is not None and len(attention_mask.shape) == 4:
+                # assumes 4D mask for efficient beam search
+                token_positions = torch.cumsum(attention_mask, dim=-1).amax(dim=(1, 2))
+                used_tokens_mask = attention_mask.amax(dim=(1, 2))
+                position_ids = (token_positions * used_tokens_mask).long() - 1
+                position_ids = position_ids[:, past_key_values_length:]


this logic should not go here, it's should go in the prepare inputs for generation, as it's purely specific to 4d beam search.

I agree, that this should be limited to just the mask code. Makes this PR more manageable. Llama can work without that, since it can accept position_ids argument. Hopefully the newer models will support this argument. (could HF make it a part of some model guidelines?)

poedator · 2023-11-30T20:33:03Z

Hi, @ArthurZucker
I limited this PR only to the mask code, proceeding with the tests.

So far I have demo in Colab with monkey patch based on this PR. It shows a negligible difference in logits obtained the old and new ways. I dent to believe that this is a rounding error somewhere. Would you support it as the basis for the tests?
BTW, where to put this new test?

Hi, @UniverseFly ,
Try the monkey patch from the Colab notebook - see if it works to implement your idea.

KexinFeng · 2023-12-04T12:25:43Z

Thanks for this PR and the demo. It is very helpful in trying the SpecInfer paper. Also in another recent progress on speculative decoding look ahead decoding Fig 5, this PR will also be useful.

ArthurZucker · 2023-12-05T14:28:17Z

Reviewing now 😉

ArthurZucker

LGTM, the test should go in the

transformers/tests/test_modeling_utils.py

Line 1481 in 8eae5ea

class AttentionMaskTester(unittest.TestCase):

poedator · 2023-12-10T08:42:14Z

squashed all earlier commits into one
added tests. Made a separate class to test with full model loading.
added support for sdpa (following F.scaled_dot_product_attention support #26572)
test_modeling_utils.py::AttentionMaskTester and ::TestAttentionImplementation tests pass
new tests pass

@ArthurZucker, please review. Hopefully it is ready to merge.

ArthurZucker

Thanks, just a few testing nits and good to go

tests/test_modeling_utils.py

ArthurZucker · 2023-12-10T16:37:07Z

tests/test_modeling_utils.py

+        self.device = torch.device("cuda:0")
+        model_name = "JackFram/llama-160m"  # small Llama-like model from FlexFlow
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
+        self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32).to(self.device)


Suggested change

self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32).to(self.device)

self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(self.device)

the smaller the better for our CI

I observed that fp16 tests are more noisy, so what I did is:

retained fp32 testsm but used even smaller model

added fp16 test with relaxed tolerances

added fp16 testing option for the top tokens order.

tests/test_modeling_utils.py

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

poedator · 2023-12-11T17:37:41Z

@ArthurZucker, pls give me a hint about NameError: name 'torch' is not defined error. Apparently a decorator or import is missing, but can't figure it out. The import and decorators seem in place...

poedator · 2023-12-18T07:38:29Z

@ArthurZucker , would you want to publish a blog post in HF blog with 4d attention use cases?
I propose to include:

memory efficient beam search (my example, from tests)
SFT packing as mentioned in Packing in SFT trl#805, suggested by @UniverseFly
look ahead decoding, suggested by @KexinFeng

ArthurZucker · 2023-12-18T08:07:40Z

If you want feel free to do so! 🤗

PhilJd · 2023-12-19T15:44:51Z

Note that not all paths of this can be torch.compiled:

The following fails due to torch.all(attention_mask == 1).

import torch
import torch.nn as nn
from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask_for_sdpa

class Model(nn.Module):
    def forward(self, inputs_embeds):
        batch_size, seq_length, _ = inputs_embeds.shape
        past_key_values_length = 10
        attention_mask = torch.tensor([1.])
        attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
            attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
        )
        return attention_mask

model = Model()
model = torch.compile(model, fullgraph=True)
model(torch.ones([1,5, 32]))

poedator · 2023-12-19T16:16:28Z

@PhilJd,
torch.all(attention_mask == 1) was present even before this PR.
see this line
it comes form #26572

have you tested the preceding commit?

PhilJd · 2023-12-19T17:03:58Z

Ah sorry, just looked at the blame - yeah, the previous commit fails as well @fxmarty .

shentianxiao · 2023-12-19T18:20:38Z

_prepare_4d_causal_attention_mask is applied only if self._use_flash_attention_2 is False (https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1039). Is it because 4D attention mask and flash attention 2 are not compatible?

shentianxiao · 2023-12-19T18:32:06Z

The function description should be updated to avoid confusion as attention_mask is not necessarily 2D now (https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_attn_mask_utils.py#L290)

poedator · 2023-12-19T21:36:35Z

@shentianxiao , thank you for your attention to the 4D attention!

_prepare_4d_causal_attention_mask is applied only if self._use_flash_attention_2 is False (https://github.com/huggingface/transformers/blob/main/src/transformers/models/llama/modeling_llama.py#L1039). Is it because 4D attention mask and flash attention 2 are not compatible?

it is not about compatibility, rather the flash_attention_2 code contrasted original mask vs modified mask coming from _prepare_4d_causal_attention_mask()

The function description should be updated to avoid confusion as attention_mask is not necessarily 2D now (https://github.com/huggingface/transformers/blob/main/src/transformers/modeling_attn_mask_utils.py#L290)

I agree, that the original mask may also be 4d-shaped now. I just started PR #28151 with documentation updates - will make edits there. Hopefully the maintainers responsible for flash_attention_2 will verify it.

poedator · 2023-12-20T12:50:53Z

IMPORTANT: this PR makes changes that can only used by few classes of models
requirements to use:

have position_ids argument in .forward() method
use modeling_attn_mask_utils.py::_prepare_4d_attention_mask() function for 4d mask generation

as of 20.12.2023, only a handful (under 20) of transformers model classes meet these criteria. Most of these classes are multimodal, which may require their own use cases for 4D masks. The pure language modelling classes fit to use the 4D mask changes from this PR are only LlamaModel, FalconModel and XGLMModel.

poedator · 2024-01-10T19:20:02Z

I made a small blog post based on this PR.
https://huggingface.co/blog/poedator/4d-masks
Big thanks to everyone who contributed and commented!

jpgard · 2024-02-18T18:25:48Z

Thanks for the amazing addition!! This is a great new feature.

Just wanted to ask a question to make sure I am using it properly. In the code here, it looks like the 4D masks are expected to have shape [batch_size, 1, seq_len, seq_len]. (I am inferring that the 1 in the expected_shape is the heads dimension so that the same mask is broadcast to all heads.) In the blog post, it describes the attention masks as having shape [heads, batch_size, input_ids_length, total_sequence_length].

My question is: are the heads and batch_size dimensions transposed in the blog post? It seems like we are actually expected to provide 4D masks where the first axis is batch size, the second is heads. The blog post implies the reverse. Since I am sometimes using a batch size of 1 in testing, this works either way, but I want to use it correctly and don't see the "proper" shape documented anywhere (perhaps it is documented somewhere and I missed it!).

Thanks!

poedator · 2024-02-19T15:32:47Z

@jpgard ,
you are correct, there was an error in my blog post.
Changed it to [batch_size, heads, input_ids_length, total_sequence_length]
thank you for raising this!

jpgard · 2024-02-19T20:03:36Z

Great, thanks for the quick reply and for your hard work on this @poedator !!

jpgard · 2024-02-19T20:39:17Z

Has this been tested with flash attention 2? Works great for me without flash attention 2, but when using flash attention I get lots of messages of the form ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [202,0,0], thread: [105,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

Lower chunk of the stack trace posted below.

 File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 798, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 549, in forward
    attn_output = self._flash_attention_forward(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 592, in _flash_attention_forward
    query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/transformers/models/llama/modeling_llama.py", line 631, in _upad_input
    query_layer = index_first_axis(
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/torch/autograd/function.py", line 553, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/admin/home-jpgard/miniconda3/envs/rtfm/lib/python3.8/site-packages/flash_attn/bert_padding.py", line 17, in forward
    return torch.gather(
RuntimeError: CUDA error: device-side assert triggered

Would be great to be able to use FA2 with this PR as the speedups are much larger as sequence length grows -- so FA2 seems like the perfect accompaniment to e.g. "packed" training sequences enabled by this PR.

poedator · 2024-02-19T21:48:32Z

@jpgard , please share some simple testing code. I will look into this issue.

DylanZSZ · 2024-12-10T06:59:05Z

Hello! I have a few questions about using custom 4D attention masks for sequence packing:

Does the custom 4D attention mask need to explicitly encode causality (e.g., enforcing a lower-triangular structure for autoregressive tasks)?
For the custom 4D attention mask, is the correct format (batch_size, 1, seq_len, seq_len), where valid attention positions are marked with 1, and invalid ones with 0?
Additionally, I noticed that enabling sequence packing with a custom 4D attention tensor results in a lower average loss compared to the non-packed scenario, even when I explicitly enforce lower-diagonality in the mask. Could you clarify why this might be the case?

Thanks in advance for your help!

poedator · 2024-12-10T14:16:41Z

Hi, @DylanZSZ ,

Does the custom 4D attention mask need to explicitly encode causality
yes, it should have the lower triangular (or more complex for special cases) structure for causality.

where valid attention positions are marked with 1, and invalid ones with 0?
actually use 0 for valid and -inf for invalid. This is how the masks are applied in transformers after some internal transformations.

enabling sequence packing with a custom 4D attention tensor results in a lower average loss compared to the non-packed scenario, even when I explicitly enforce lower-diagonality in the mask. Could you clarify why this might be the case?
Hard to tell, it should be same in theory. suggestions: (1) make sure you use 0 and -inf, as above, (2) try with some simpler examples and follow the attention in regular case and with custom mask. (3) check if padding is done properly.

poedator marked this pull request as ready for review November 16, 2023 19:12

ArthurZucker mentioned this pull request Nov 22, 2023

Allow passing 2D attention mask #27640

Open

patrickvonplaten reviewed Nov 27, 2023

View reviewed changes

ArthurZucker reviewed Nov 27, 2023

View reviewed changes

poedator force-pushed the mask4 branch from 53a7e77 to 204b4b8 Compare November 30, 2023 20:22

ArthurZucker reviewed Dec 5, 2023

View reviewed changes

poedator force-pushed the mask4 branch from 020e46b to 2d75ee3 Compare December 9, 2023 23:44

poedator added 3 commits December 10, 2023 10:48

edits to _prepare_4d_causal_attention_mask()

75dff5b

initial tests for 4d mask

d9bb95c

attention_mask_for_sdpa support

2fb7de9

poedator force-pushed the mask4 branch from 2d75ee3 to 2fb7de9 Compare December 10, 2023 08:04

poedator added 3 commits December 10, 2023 11:19

added test for inner model hidden

ba809c4

added autotest decorators

4001b65

test mask dtype to torch.int64

8aa580b

ArthurZucker reviewed Dec 10, 2023

View reviewed changes

poedator and others added 4 commits December 11, 2023 10:26

torch.testing.assert_close

6dd3104

Co-authored-by: Arthur <48595927+ArthurZucker@users.noreply.github.com>

torch_device and @torch_gpu in tests

2b51f1c

upd tests

51f22ab

+torch decorators

3bf35b7

poedator mentioned this pull request Dec 19, 2023

4D mask documentation updates #28151

Closed

This was referenced Dec 23, 2023

feature request: support 4D attention masks flexflow/flexflow-train#1250

Open

feature request: support 4d attention masks OpenNMT/CTranslate2#1594

Open

ArthurZucker mentioned this pull request Jan 11, 2024

feat: Sequential beam search(a.k.a Low-memory beam search) #26304

Merged

5 tasks

poedator mentioned this pull request Mar 7, 2024

custom 4d attention masks broken by #28937 #29525

Closed

KexinFeng mentioned this pull request Apr 21, 2024

Any plans to support tree attention mask? Dao-AILab/flash-attention#924

Open

KexinFeng mentioned this pull request May 24, 2024

[Speculative Decoding] Medusa Implementation with Top-1 proposer vllm-project/vllm#4978

Merged

JackCai1206 mentioned this pull request Jul 19, 2024

Using Trainer + a pretrained tokenizer + 4D attention mask is extremely slow #32101

Open

4 tasks

littletomatodonkey mentioned this pull request Jul 21, 2024

[Feature]: 4D Attention Mask vllm-project/vllm#6615

Closed

jpgard mentioned this pull request Jul 24, 2024

"inverted" form required for 4D masking not defined / 4D attention masks breaks with transformers >=4.40 #32195

Closed

4 tasks

XiangTodayEatsWhat mentioned this pull request Oct 12, 2024

Does FA2 support 4D attention mask? Dao-AILab/flash-attention#1274

Open

poedator deleted the mask4 branch January 14, 2025 13:46

	self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float32).to(self.device)
	self.model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16).to(self.device)

4D attention_mask support #27539

4D attention_mask support #27539

Uh oh!

Conversation

poedator commented Nov 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

patrickvonplaten commented Nov 20, 2023

Uh oh!

poedator commented Nov 20, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

UniverseFly commented Nov 22, 2023

Uh oh!

poedator commented Nov 22, 2023

Uh oh!

UniverseFly commented Nov 22, 2023

Uh oh!

poedator commented Nov 22, 2023

Uh oh!

patrickvonplaten left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Nov 27, 2023

Choose a reason for hiding this comment

Uh oh!

poedator Nov 30, 2023

Choose a reason for hiding this comment

Uh oh!

poedator commented Nov 30, 2023

Uh oh!

KexinFeng commented Dec 4, 2023

Uh oh!

ArthurZucker commented Dec 5, 2023

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

poedator commented Dec 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ArthurZucker Dec 10, 2023

Choose a reason for hiding this comment

Uh oh!

poedator Dec 11, 2023

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

poedator commented Dec 11, 2023

Uh oh!

poedator commented Dec 18, 2023

Uh oh!

ArthurZucker commented Dec 18, 2023

Uh oh!

PhilJd commented Dec 19, 2023

Uh oh!

poedator commented Dec 19, 2023

Uh oh!

PhilJd commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shentianxiao commented Dec 19, 2023

Uh oh!

shentianxiao commented Dec 19, 2023

Uh oh!

poedator commented Dec 19, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poedator commented Dec 20, 2023

Uh oh!

4D `attention_mask` support #27539

4D `attention_mask` support #27539

poedator commented Nov 16, 2023 •

edited

Loading

poedator commented Nov 20, 2023 •

edited

Loading

ArthurZucker left a comment •

edited

Loading

poedator commented Dec 10, 2023 •

edited

Loading

PhilJd commented Dec 19, 2023 •

edited

Loading

poedator commented Dec 19, 2023 •

edited

Loading

jpgard commented Feb 18, 2024 •

edited

Loading

jpgard commented Feb 19, 2024 •

edited

Loading