Llama: fix custom 4D masks #29930

gante · 2024-03-28T10:35:30Z

What does this PR do?

Fixes the issue raised by @poedator in this comment.

Causal mask is now of shape [..., seq_len, full_len], as opposed to [..., full_len, full_len]. This means custom 4D attention masks are now the whole causal mask, so we don't need a sliced copy -- we can copy the whole thing :)

This PR also expands the support of custom 4D attention mask: we can pass both the full mask ([..., full_len, full_len]) or the partial mask ([..., seq_len, full_len]).

gante · 2024-03-28T10:36:20Z

src/transformers/models/cohere/modeling_cohere.py

-        if attention_mask is not None:
-            causal_mask = causal_mask.clone()  # copy to contiguous memory for in-place edit
-            if attention_mask.dim() == 2:
+        if attention_mask is not None and attention_mask.dim() == 4:


reordered the logic: custom 4D masks are now a superset of the default mask, so we don't need to create the default mask first :)

gante · 2024-03-28T10:36:56Z

src/transformers/models/cohere/modeling_cohere.py

+                offset = cache_position[0]
+                mask_slice = mask_slice[..., offset : offset + sequence_length, :]
+            causal_mask = mask_slice
+        else:


This else has no changes. Only the if attention_mask is not None and attention_mask.dim() == 4: is different.

gante · 2024-03-28T10:37:25Z

tests/test_modeling_utils.py

        self.assertEqual(decoded_0, decoded_1b)
+
+        # Case 2: we pass a 4D attention mask regarding the full sequence length (i.e. [..., full_len, full_len])


Added this test case (we can now pass full custom 4D attention masks)

HuggingFaceDocBuilderDev · 2024-03-28T10:54:46Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Thanks LGTM just want to always trigger the tests

src/transformers/models/cohere/modeling_cohere.py

tests/test_modeling_utils.py

poedator · 2024-03-28T14:02:33Z

Thanks @gante !
I tested the code and it works.
My only suggestion is to cover the 4D masks in the documentation for Llama and other models.

gante · 2024-03-28T14:28:13Z

My only suggestion is to cover the 4D masks in the documentation for Llama and other models.

@poedator would you like to open a PR with that? As a user, you'll probably have cool examples in mind!

poedator · 2024-03-28T14:34:31Z

My only suggestion is to cover the 4D masks in the documentation for Llama and other models.

@poedator would you like to open a PR with that? As a user, you'll probably have cool examples in mind!

will try, but not this week...

gante · 2024-03-28T15:56:53Z

tests/models/llama/test_modeling_llama.py

@@ -735,3 +736,138 @@ def test_model_7b_logits(self):
        ]
        infilling = tokenizer.batch_decode(generated_ids)
        self.assertEqual(infilling, EXPECTED_INFILLING)
+
+
+@slow


This set of slow tests was moved to the llama test file -> if we run the slow llama tests, which we often request, this will now be triggered

gante · 2024-03-28T16:00:31Z

tests/test_modeling_common.py

@@ -4027,6 +4027,101 @@ def test_flash_attn_2_from_config(self):

                self.assertFalse(fa2_correctly_converted)

+    def _get_custom_4d_mask_test_data(self):


This set of tests are now:

part of the mixin, so they are run on all push commits

a fast test, using model = model_class(config) from the test config

triggered by model_class._supports_cache_class == True -- recent LLMs [llama, cohere, gemma, mistral, mixtral, starcoder2, ...] have this attribute set to True and are 4D mask-compatible. Older models are often not compatible. Over time, as we spread the cache refactor, this test will be run on those classes as well 👀

gante · 2024-03-28T16:01:21Z

@ArthurZucker ready for a re-review (test rework) -- we now have on push tests for all recent models + custom 4D mask :)

poedator · 2024-03-30T01:21:45Z

@gante ,
May I suggest adding extra tests for 4D masks with StaticCache? I am concerned that StaticCache code may handle the custom masks differently. I intend to use 4D masks with StaticCache in my new speculative decoding implementation.
Here are the additional methods for Mask4DTestHard:
https://gist.github.com/poedator/f1c15551d202df2682c65c1bbdcb1c07

I made the cache longer than the masks and padded the masks to the cache length. is this the correct way?

ArthurZucker · 2024-04-02T09:17:56Z

Sorry for the delay, let's rebase on main as well

ArthurZucker

Very good! Let's rebase on main, #30047 was merged, and run slow tests!

poedator · 2024-04-10T09:20:28Z

please, please merge this PR - I need it for my speculative decoding paper project! The 4D masks are essential for it.
@gante @ArthurZucker

ArthurZucker · 2024-04-18T08:37:48Z

Sorry just got back to github 😓 could you rebase!

poedator · 2024-04-19T15:37:01Z

Sorry just got back to github 😓 could you rebase!
Too bad it did not make it into 4.40 (

I rebased this PR in new one #30348 and added few important changes.

poedator · 2024-04-22T18:30:44Z

@gante, thank you for the rebase. Meanwhile I added more improvements in #30348 - let's close this #29930 and continue there.

gante · 2024-04-23T08:51:22Z

Closing in favor of #30348

gante requested a review from ArthurZucker March 28, 2024 10:35

gante commented Mar 28, 2024

View reviewed changes

ArthurZucker reviewed Mar 28, 2024

View reviewed changes

src/transformers/models/cohere/modeling_cohere.py Outdated Show resolved Hide resolved

tests/test_modeling_utils.py Outdated Show resolved Hide resolved

gante requested a review from ArthurZucker March 28, 2024 15:54

gante commented Mar 28, 2024

View reviewed changes

ArthurZucker approved these changes Apr 5, 2024

View reviewed changes

poedator mentioned this pull request Apr 19, 2024

Llama: fix custom 4D masks, v2 #30348

Merged

gante added 4 commits April 22, 2024 18:00

4d mask fixes

cbea1d1

Update custom 4D mask logic

57503c0

test moved to mixin

f7a1e06

missing slow decorator

b08631b

gante force-pushed the fix_4d_llama branch from 7962886 to b08631b Compare April 22, 2024 18:03

gante closed this Apr 23, 2024

		self.assertEqual(decoded_0, decoded_1b)

		# Case 2: we pass a 4D attention mask regarding the full sequence length (i.e. [..., full_len, full_len])

		@@ -4027,6 +4027,101 @@ def test_flash_attn_2_from_config(self):

		self.assertFalse(fa2_correctly_converted)

		def _get_custom_4d_mask_test_data(self):

Llama: fix custom 4D masks #29930

Llama: fix custom 4D masks #29930

Uh oh!

Conversation

gante commented Mar 28, 2024

What does this PR do?

Uh oh!

gante Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

gante Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

gante Mar 28, 2024

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Mar 28, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

poedator commented Mar 28, 2024

Uh oh!

gante commented Mar 28, 2024

Uh oh!

poedator commented Mar 28, 2024

Uh oh!

gante Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gante commented Mar 28, 2024

Uh oh!

poedator commented Mar 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ArthurZucker commented Apr 2, 2024

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

poedator commented Apr 10, 2024

Uh oh!

ArthurZucker commented Apr 18, 2024

Uh oh!

poedator commented Apr 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

poedator commented Apr 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gante commented Apr 23, 2024

Uh oh!

Uh oh!

gante Mar 28, 2024 •

edited

Loading

gante Mar 28, 2024 •

edited

Loading

poedator commented Mar 30, 2024 •

edited

Loading

poedator commented Apr 19, 2024 •

edited

Loading

poedator commented Apr 22, 2024 •

edited

Loading