Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow passing 2D attention mask #27640

Open
UniverseFly opened this issue Nov 21, 2023 · 13 comments
Open

Allow passing 2D attention mask #27640

UniverseFly opened this issue Nov 21, 2023 · 13 comments
Labels
Feature request Request for a new feature

Comments

@UniverseFly
Copy link

Feature request

Allow passing a 2D attention mask in model.forward.

Motivation

With this feature, it would be much easier to avoid cross-context contamination during pretraining and supervised finetuning when packing the sequences together for more efficient training.

Here is an example usecase discussed in (huggingface/trl#805):

Your contribution

Upon investigation into the source code, I found the current logic of initializing attention masks is mostly a fixed code snippet encoded in each model:

        if getattr(self.config, "_flash_attn_2_enabled", False):
            # 2d mask is passed through the layers
            attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
        else:
            # 4d mask is passed through the layers
            attention_mask = _prepare_4d_causal_attention_mask(
                attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
            )

To enable this behavior may require hacking into each model. I should be able to handle part of them and submit a draft PR. But before that, I want to know if this feature request is reasonable.

@ArthurZucker ArthurZucker added the Feature request Request for a new feature label Nov 22, 2023
@ArthurZucker
Copy link
Collaborator

Hey, the model's forward already supports passing a 2d attention mask, it is just expended to 4d because that is the format required by the attention implementation.
Would you mind elaborating on what you cannot currently do? (Might be related to #27539?)

@UniverseFly
Copy link
Author

Hey, the model's forward already supports passing a 2d attention mask, it is just expended to 4d because that is the format required by the attention implementation.
Would you mind elaborating on what you cannot currently do? (Might be related to #27539?)

Yeah, I might not make it clear. The current "2D"s are [batch_size, num_tokens]. What I suggested was [batch_size, num_tokens, num_tokens] so we can have a matrix for each batch that explicitly defines what each token should attend to. #27539 seems relevant

@jwkirchenbauer
Copy link

Just chiming in, here is some more context (also very interested in this feature). From what I understand, this is not trivial implement in general...

As one current example, the axolotl finetuning harness implements efficient sample packing with correct block diagonal attention masking through a series of monkey patches for the underlying huggingface model definitions for a few of the very popular models like llama and mistral. Though I have not looked through the code in detail, I believe it leverages the fact that the flash attention api supports the masking required to implement this scheme.

It is relevant for efficient finetuning (the reason it's incorporated into axolotl), and general wisdom (and whispers from inside large corps) suggest that this type of block diagonal masking is better for large scale training code.

(#27539 is relevant, but it looks like the focus may be on the beam search/speculative decoding use case, not this slightly more general use case. Also here's a relevant hf forum post https://discuss.huggingface.co/t/the-correct-attention-mask-for-examples-packing/52909/2)

@meliksahturker
Copy link

Packing is indeed a good use-case for supporting 2D attention mask for huggingface models.

@ArthurZucker
Copy link
Collaborator

Packing is planned

@thincal
Copy link

thincal commented Jun 5, 2024

Packing is planned

Hello, is there any detailed schedule to support this feature ? many thanks.

@ArthurZucker
Copy link
Collaborator

Most probably not next release, but the one after that!

@shashwat14
Copy link

Looking forward to this feature!

@ArthurZucker
Copy link
Collaborator

#31446 for packing

@insujang
Copy link
Contributor

Hi @ArthurZucker, does #31446 include packing? It seems it is just refactoring flash attention, a prerequisite of packing not packing itself.

@ArthurZucker
Copy link
Collaborator

Yep, it's planned not done yet. I was gonna do both but ended up splitting!

@ArthurZucker
Copy link
Collaborator

cc @Cyrilvallez

@ArthurZucker
Copy link
Collaborator

#33932 is related for the packing as well!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

7 participants