🚨🚨[core] Completely rewrite the masking logic for all attentions #37866

Cyrilvallez · 2025-04-29T14:26:21Z

What does this PR do?

As per the title. The goal is to properly separate masking logic from modeling code itself, to continue our objective of simplifying the library.

Code is much simpler to understand
Much more general: always work for all lengths, and all attention implementations, e.g.:
- flex attention now works with sliding/hybrid models (not the case before)
- FA2 now works with static caches (including models with default hybrid structures) (was only the case for hybrid models before)
All models can use all Cache classes (e.g. models with Hybrid structure can default back to use DynamicCache)
Extremely scalable in the future: any pattern of layers can be taken into account WITHOUT ANY CHANGE to modeling or masking. A new masking pattern (e.g. the recently introduced chunked attention for Llama4) can be added with minimal efforts (just add a new mask_mod to describe it, and voila!)
A single truth: mask creation was copied over and over again, but sometimes with slight changes to account for sliding windows or similar. This would eventually lead to mistakes or inefficiencies as things would be "forced to fit", and a lot of maintenance burden
compile compatible: the new mask creation is technically compile compatible - it should however stay outside what is compiled in the forward to avoid recompilations as it's being done in generate
Allow external mask creation: In case someone passes their custom attention implementation, they may need their own mask creation function, which is now supported
TGI/vLLM backend should be even more efficient now, as we don't waste compute on creating a useless mask (would previously create a 4d mask as for sdpa, which would not be used)

github-actions · 2025-04-29T14:26:32Z

Hi 👋, thank you for opening this pull request! The pull request is converted to draft by default. The CI will be paused while the PR is in draft mode. When it is ready for review, please click the Ready for review button (at the bottom of the PR page). This will assign reviewers and trigger CI.

HuggingFaceDocBuilderDev · 2025-04-29T14:38:58Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Looks very vey nice!
One thing I want to consider is to rather call the sliding, causal and chuncking directly in the modeling.
For example:

llama only need causal_mask, under the hood the causal mask should do an and with sddpa or cflash or flex
gemma need sliding_causal: same
llama4 needs chuncked causal
I want the modeling to call an explicit function, rather than the mega general one!

This would keep our philosophy, as we don't want too general stuff hapenning when not needed (ex: llama should never care about sliding in codepathes)

Also misssing doc about how to add a new func!

src/transformers/cache_utils.py

src/transformers/integrations/executorch.py

src/transformers/models/gemma2/configuration_gemma2.py

src/transformers/models/gemma2/modeling_gemma2.py

tests/test_modeling_common.py

src/transformers/masking_utils.py

ydshieh · 2025-05-15T10:43:44Z

Wow!!!!!!!! 🚀

This PR seems worth a manually full CI. Ping me when it's time you think this PR is ready to trigger CI.

ArthurZucker

Damn nice

src/transformers/masking_utils.py

src/transformers/models/gemma2/modular_gemma2.py

src/transformers/models/llama4/configuration_llama4.py

src/transformers/models/llama4/modeling_llama4.py

ArthurZucker

REview for the core logic, IMO can be simplified! BUt the modeling part is absolutely perfect!
For the visualization, I'll see how we could just overwrite the repr without affecting other operations!

src/transformers/masking_utils.py

src/transformers/generation/utils.py

vasqu

Just a quick question on this refactor: If I understand the code correctly, then the focus is currently on causal masks only, correct?

Would be nice to add a non-causal alternative which should only use a padding mask and expand respectively to the q_len and kv_len. That's more food for thought :D I dont want to make this PR even harder than it is.

Cyrilvallez · 2025-05-16T15:28:19Z

For now it's mostly on causal masks because they are the one we need, but the idea is that it can be extended super easily from a set of mask primitives!

ArthurZucker

Mega nice!
TODO before merging:

move the causal_mask_mapping to a class attribute!
show example of how to register a new function, but minimal (without sdpa correction for example)
Make sure full graph training is not broken maybe? or at least fa2 training

That should be i

src/transformers/cache_utils.py

src/transformers/masking_utils.py

src/transformers/models/aria/modeling_aria.py

src/transformers/models/cohere2/configuration_cohere2.py

src/transformers/models/cohere2/modeling_cohere2.py

src/transformers/models/gemma3/modeling_gemma3.py

ArthurZucker

Let's go!

ArthurZucker · 2025-05-22T08:01:14Z

docs/source/en/attention_interface.md

+def my_new_sdpa_mask(*args, **kwargs):
+    print("I just entered the attention mask computation")
+    return sdpa_mask(*args, **kwargs)
+


let's rather show how to do something like the paligemma or document masking here, something relevant!

Those are a bit different, it's modifying the mask pattern vs adding a new mask format for the attention itself (both are complementary)

docs/source/en/attention_interface.md

src/transformers/generation/utils.py

src/transformers/integrations/tensor_parallel.py

src/transformers/models/gemma3/modular_gemma3.py

BenjaminBossan · 2025-06-02T14:15:35Z

Hi @Cyrilvallez I noticed that after this PR, calling prepare_inputs_for_generation can return an attention_mask that is a dict instead of a tensor. Is this expected? If yes, I need to update PEFT.

Reproducer:

from transformers import AutoModelForCausalLM
from transformers.cache_utils import HybridCache
import torch

model_id = 'hf-internal-testing/tiny-random-Gemma3ForCausalLM'
model = AutoModelForCausalLM.from_pretrained(model_id)

inputs = torch.arange(6).view(2, 3)
attention_mask = torch.ones_like(inputs)
# cache is required, w/o cache a tensor is returned as expected
cache = HybridCache(model.config, max_batch_size=2, max_cache_len=3)

model_kwargs = model.prepare_inputs_for_generation(
    inputs, attention_mask=attention_mask, past_key_values=cache, cache_position=torch.arange(3)
)
mask = model_kwargs['attention_mask']
assert isinstance(mask, torch.Tensor), f"expected attention mask to be tensor, got {mask}"

Before the PR (f8630c778c9220defecf1e3026d3438108b0baba), this passes. After the PR (163138a911c1fb4451ec4b32edaee20918a59def), it fails with:

AssertionError: expected attention mask to be tensor, got {'sliding_attention': tensor([[[[ True, False, False],
          [ True,  True, False],
          [ True,  True,  True]]],


        [[[ True, False, False],
          [ True,  True, False],
          [ True,  True,  True]]]])}

guangy10 · 2025-06-02T21:37:18Z

src/transformers/integrations/executorch.py

+        if not hasattr(model.config, "layer_types"):
+            # If `layer_types` is not specified explicitly in the config, there is only 1 type of layers, so
+            # export will use `StaticCache` by default.
+            logging.info("Using `StaticCache` for export as `layer_types` is not specified in the config.")
            self.model = TorchExportableModuleWithStaticCache(model)
        else:
-            if model.config.cache_implementation == "hybrid":
-                self.model = TorchExportableModuleWithHybridCache(model, max_batch_size, max_cache_len)
-            else:
-                raise ValueError(
-                    f"Unsupported cache implementation: {model.config.cache_implementation}. "
-                    "Please use `hybrid` or `static`."
-                )
+            self.model = TorchExportableModuleWithHybridCache(model, max_batch_size, max_cache_len)


@Cyrilvallez What is layer_types? I'm concerning whether changes here are backwards compatible. For existing models on Hub like google/gemma-3-1b, it doesn't seem to come with the layer_types so it will fallback to the static cache which doesn't look correct.

kimishpatel · 2025-06-07T16:37:58Z

One comment i have is that the way mask calculation is incorporated in most models is that the calculation of mask happens at model level. e.g. here https://github.com/huggingface/transformers/blob/main/src/transformers/models/gemma3/modeling_gemma3.py#L565-L566, however, different cache implementations may imply different attention masks. Different layers may have different cache impl, for example some layers can have sliding window of different size, others may use attention sink to keep say first few or some tokens. I feel the best way for the custom mask is at the attention layer so that the said layer can pass in all the information, including kv cache, to the custom mask function (e.g. layer_index).

Cyrilvallez · 2025-06-10T09:07:08Z

Hey, sorry all I was on vacations!

@BenjaminBossan indeed, this is expected. Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type). Sorry this broke your tests! 😬

@guangy10 if you look at the configs, e.g. here, you'll see that the attribute was added in a BC manner for all models that were refactored! Let me know if you notice any issue though!

@kimishpatel In transformers, Cache are not at the layer level, so as of now only some configurations are acceptable (though I've had in mind to change that for some time, to make it more modular). And computing the mask at the AttentionLayer-level is not only redundant (most layers will create the same mask, wasting precious time), but it breaks compile completely, as we cannot pre-compute the masks anymore. For now, there are no known models with sliding windows of different sizes for different layers, so we decided to make it as simple as possible. This was taken into account when doing this refactor though, no worries, we definitely thought about it to scale easily in the future should this scenario happen

Resolves CI errors such as this one: https://github.com/huggingface/peft/actions/runs/15481482956/job/43588020111#step:5:53182 After resolving that error, other errors can occur, but they're unrelated and investigated independently. After the transformers change in huggingface/transformers#37866, it can happen that: > Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type) As PEFT operates on the attention mask for prompt learning methods, we need to adjust the code for the possibility of attention_mask being a dict. Right now, I simply extract the single value if the dict is just one element. For other sizes, I just raise an error, as I don't know how to deal with that. For our tests, this is enough but we might need to find a better solution in the future.

kimishpatel · 2025-06-10T15:19:42Z

And computing the mask at the AttentionLayer-level is not only redundant (most layers will create the same mask, wasting precious time),

To be fair, I doubt attention mask calculation has that much impact on performance for the most models. I have implemented ring buffer based kv cache, that needs a very different way of calculating mask and that mask calculation, while redundant, happens at attention layer. I have not observed any significant amount of time spent in there.

Although I think for block mask in flex attention, you might be right. That one is non-trivial.

Cache are not at the layer level,

how so? cache's update functions accept layer_idx, so they do have to know what layer the update belongs to.

I do understand though that transformers is not exactly providing building blocks for model authoring so from that perspective composability and modularity has limited value i suppose

guangy10 · 2025-06-10T18:01:28Z

@guangy10 if you look at the configs, e.g. here, you'll see that the attribute was added in a BC manner for all models that were refactored! Let me know if you notice any issue though!

@Cyrilvallez Let's follow up in #38646

Resolves CI errors such as this one: https://github.com/huggingface/peft/actions/runs/15481482956/job/43588020111#step:5:53182 After resolving that error, other errors can occur, but they're unrelated and investigated independently. After the transformers change in huggingface/transformers#37866, it can happen that: > Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type) As PEFT operates on the attention mask for prompt learning methods, we need to adjust the code for the possibility of attention_mask being a dict. Right now, I simply extract the single value if the dict is just one element. For other sizes, I just raise an error, as I don't know how to deal with that. For our tests, this is enough but we might need to find a better solution in the future.

## Purpose ## * Fix tracing for model definitions introduced as part of `transformers==4.53` * Resolves #1603 ## Background ## In the latest transformers release, this change landed which changed the name of the function which generates the causal mask. huggingface/transformers#37866 ## Changes ## * Extend the list of function names to ignore during tracing, specifically targeting functions which create causal masks * Update debugger tool to use ignore list from `DatasetArguments` * Update Tracer to skip masking function as part of autowrapping any functions which were not caught by the autowrapper ## Testing ## * `tests/llmcompressor/transformers/tracing/test_models.py` now passes with the latest `transformers==4.53` --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Resolves CI errors such as this one: https://github.com/huggingface/peft/actions/runs/15481482956/job/43588020111#step:5:53182 After resolving that error, other errors can occur, but they're unrelated and investigated independently. After the transformers change in huggingface/transformers#37866, it can happen that: > Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type) As PEFT operates on the attention mask for prompt learning methods, we need to adjust the code for the possibility of attention_mask being a dict. Right now, I simply extract the single value if the dict is just one element. For other sizes, I just raise an error, as I don't know how to deal with that. For our tests, this is enough but we might need to find a better solution in the future.

## Purpose ## * Fix tracing for model definitions introduced as part of `transformers==4.53` * Resolves vllm-project#1603 ## Background ## In the latest transformers release, this change landed which changed the name of the function which generates the causal mask. huggingface/transformers#37866 ## Changes ## * Extend the list of function names to ignore during tracing, specifically targeting functions which create causal masks * Update debugger tool to use ignore list from `DatasetArguments` * Update Tracer to skip masking function as part of autowrapping any functions which were not caught by the autowrapper ## Testing ## * `tests/llmcompressor/transformers/tracing/test_models.py` now passes with the latest `transformers==4.53` --------- Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

Resolves CI errors such as this one: https://github.com/huggingface/peft/actions/runs/15481482956/job/43588020111#step:5:53182 After resolving that error, other errors can occur, but they're unrelated and investigated independently. After the transformers change in huggingface/transformers#37866, it can happen that: > Models using different types of attention in different layers (i.e. gemma3) will now have a dict returned by prepare_inputd_for_generation (one dict entry per attention type) As PEFT operates on the attention mask for prompt learning methods, we need to adjust the code for the possibility of attention_mask being a dict. Right now, I simply extract the single value if the dict is just one element. For other sizes, I just raise an error, as I don't know how to deal with that. For our tests, this is enough but we might need to find a better solution in the future.

github-actions bot marked this pull request as draft April 29, 2025 14:26

Cyrilvallez changed the title ~~Refactor mask~~ [core] Completely rewrite the masking logic for all attentions May 8, 2025

Cyrilvallez force-pushed the refactor-mask branch 3 times, most recently from 53ca556 to ce42aa7 Compare May 12, 2025 07:49

Cyrilvallez marked this pull request as ready for review May 12, 2025 16:25

ArthurZucker reviewed May 13, 2025

View reviewed changes

ArthurZucker reviewed May 15, 2025

View reviewed changes

ArthurZucker reviewed May 16, 2025

View reviewed changes

src/transformers/masking_utils.py Outdated Show resolved Hide resolved

src/transformers/masking_utils.py Show resolved Hide resolved

gante reviewed May 16, 2025

View reviewed changes

src/transformers/generation/utils.py Outdated Show resolved Hide resolved

ArthurZucker mentioned this pull request May 16, 2025

🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models #38108

Merged

38 tasks

vasqu reviewed May 16, 2025

View reviewed changes

Cyrilvallez force-pushed the refactor-mask branch from 0b6bbe5 to 7fc4f91 Compare May 16, 2025 15:24

ArthurZucker approved these changes May 19, 2025

View reviewed changes

Cyrilvallez force-pushed the refactor-mask branch from 28e232c to 5170e9d Compare May 20, 2025 15:08

Cyrilvallez changed the title ~~[core] Completely rewrite the masking logic for all attentions~~ 🚨🚨[core] Completely rewrite the masking logic for all attentions May 20, 2025

Cyrilvallez force-pushed the refactor-mask branch 2 times, most recently from dee568c to 4a2e906 Compare May 21, 2025 11:30

Cyrilvallez and others added 8 commits May 21, 2025 19:50

start

e083d5c

start having a clean 4d mask primitive

e1d43c4

Update mask_utils.py

59a69c4

Update mask_utils.py

8aa61b0

switch name

ee7bafd

Update masking_utils.py

bfcc5d8

add a new AttentionMask tensor class

f92757a

fix import

932c17b

Cyrilvallez added 5 commits May 21, 2025 22:13

Update masking_utils.py

7979ac6

executorch patch

ba6501c

style

269969e

CIs

75ccf7a

use register in executorch

7bcd55f

ArthurZucker approved these changes May 22, 2025

View reviewed changes

final comments!

9245fcd

Cyrilvallez merged commit 163138a into main May 22, 2025
21 checks passed

Cyrilvallez deleted the refactor-mask branch May 22, 2025 09:38

Cyrilvallez mentioned this pull request May 22, 2025

Fix image token mask in Gemma3 #38295

Merged

FightingZhen mentioned this pull request May 26, 2025

[bugfix] fix flex-attention not supported on Ascend NPU, update BlockMask type annotations to str #38369

Closed

5 tasks

guangy10 reviewed Jun 2, 2025

View reviewed changes

guangy10 mentioned this pull request Jun 6, 2025

Unbreak optimum-executorch #38646

Merged

3 tasks

jiqing-feng mentioned this pull request Jun 9, 2025

[BUG] Got nan logits after mask logic refactor #38690

Closed

4 tasks

BenjaminBossan mentioned this pull request Jun 10, 2025

FIX Account for attention mask being a dict, fix generate issues with gemma huggingface/peft#2579

Merged

kylesayrs mentioned this pull request Jun 26, 2025

[Tracing] Update ignored functions list vllm-project/llm-compressor#1599

Merged

angelayi mentioned this pull request Jul 2, 2025

Regression in llama2 model export pytorch/pytorch#157323

Open

es94129 mentioned this pull request Jul 9, 2025

v4.53.0+ starts erroring with 'Gemma3TextConfig' object has no attribute 'sliding_window_pattern' with vLLM #39290

Closed

imangohari1 mentioned this pull request Aug 18, 2025

Added the SWA to Gemma2. huggingface/optimum-habana#2210

Merged

3 tasks

🚨🚨[core] Completely rewrite the masking logic for all attentions #37866

🚨🚨[core] Completely rewrite the masking logic for all attentions #37866

Uh oh!

Conversation

Cyrilvallez commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

github-actions bot commented Apr 29, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Apr 29, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ydshieh commented May 15, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented May 16, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez May 22, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

BenjaminBossan commented Jun 2, 2025

Uh oh!

guangy10 Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

Cyrilvallez commented Apr 29, 2025 •

edited

Loading

kimishpatel commented Jun 10, 2025 •

edited

Loading

guangy10 commented Jun 10, 2025 •

edited

Loading