Split the graphs to run with flash_attention on 1x #75

kalyanjk · 2024-02-26T17:07:19Z

With flash attention enabled for larger batch sizes, recipe arc hbm memory size exceeds QueueComputeScal arc hbm memory. Hence split the graph on 1x.

libinta · 2024-02-26T18:16:33Z

optimum/habana/transformers/models/llama/modeling_llama.py

@@ -676,6 +677,9 @@ def forward(
        next_decoder_cache = () if not use_new_cache else None

        for layer_idx, decoder_layer in enumerate(self.layers):
+            if  torch.distributed.is_initialized() == False:
+                htcore.mark_step()


@kalyanjk what's the impact for input/output not introduced oom? should we add an argument in text-generation from cmd line?

@kalyanjk ,why only mark_step() for 1x?

For 8x mark_step will be introduced through a collective call.

@kalyanjk what's the impact for input/output not introduced oom? should we add an argument in text-generation from cmd line?

The issue is not with oom. The real issue is recipe size being too large and also compilation time is too high.

Please update as below
if lazy_mode and (torch.distributed.is_initialized() is False or torch.distributed.get_world_size() == 1):

puneeshkhanna · 2024-02-28T07:57:09Z

@kalyanjk - we can abandon this PR. I have handled the change in #65.
This also helps 8x inference.
I m checking 1x perf results too.
Further need to check finetuning script once.

puneeshkhanna · 2024-02-28T08:16:59Z

Wait we should not put mark step after the start of loop. Will create more graphs and perf is lower.

kalyanjk · 2024-02-29T05:43:43Z

Wait we should not put mark step after the start of loop. Will create more graphs and perf is lower.
@puneeshkhanna
On G3 we were seeing good perf with mark_step inside the for loop. With mark_step outside the for loop we are not able to run on single card. This issue is also present in G2

msinnha1

Verified the change and it is required for faster recipe compilation

msinnha1 · 2024-03-01T05:53:13Z

optimum/habana/transformers/models/llama/modeling_llama.py

@@ -23,6 +23,7 @@
    _gaudi_prepare_4d_causal_attention_mask,
 )

+import habana_frameworks.torch.core as htcore


If you rebase to latest then this htcore import is not required, as it is part of PR#65

msinnha1

lgtm

* Split the graphs to run with flash_attention on 1x * Added lazy_mode check and removed additional htcore import --------- Co-authored-by: Kalyan <kkumar@habana.ai>

kalyanjk · 2024-06-11T05:47:11Z

This PR solves the actual issue #126

astachowiczhabana · 2024-06-12T09:15:15Z

huggingface#875

Split the graphs to run with flash_attention on 1x

ab65e67

kalyanjk requested review from mandy-li and libinta as code owners February 26, 2024 17:07

kalyanjk requested a review from a user February 26, 2024 17:07

libinta reviewed Feb 26, 2024

View reviewed changes

Merge branch 'HabanaAI:habana-main' into decoder_mark_step

791a644

msinnha1 reviewed Mar 1, 2024

View reviewed changes

Added lazy_mode check and removed additional htcore import

d4d1b9c

msinnha1 approved these changes Mar 1, 2024

View reviewed changes

ghost approved these changes Mar 4, 2024

View reviewed changes

ghost merged commit eec5b3f into HabanaAI:habana-main Mar 4, 2024

kalyanjk deleted the decoder_mark_step branch July 5, 2024 11:47

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split the graphs to run with flash_attention on 1x #75

Split the graphs to run with flash_attention on 1x #75

kalyanjk commented Feb 26, 2024

libinta Feb 26, 2024

mandy-li Feb 26, 2024

kalyanjk Feb 27, 2024

kalyanjk Feb 27, 2024

puneeshkhanna Mar 5, 2024

puneeshkhanna commented Feb 28, 2024

puneeshkhanna commented Feb 28, 2024

kalyanjk commented Feb 29, 2024

msinnha1 left a comment

msinnha1 Mar 1, 2024

msinnha1 left a comment

kalyanjk commented Jun 11, 2024

astachowiczhabana commented Jun 12, 2024 •

edited

Loading

Split the graphs to run with flash_attention on 1x #75

Split the graphs to run with flash_attention on 1x #75

Conversation

kalyanjk commented Feb 26, 2024

libinta Feb 26, 2024

Choose a reason for hiding this comment

mandy-li Feb 26, 2024

Choose a reason for hiding this comment

kalyanjk Feb 27, 2024

Choose a reason for hiding this comment

kalyanjk Feb 27, 2024

Choose a reason for hiding this comment

puneeshkhanna Mar 5, 2024

Choose a reason for hiding this comment

puneeshkhanna commented Feb 28, 2024

puneeshkhanna commented Feb 28, 2024

kalyanjk commented Feb 29, 2024

msinnha1 left a comment

Choose a reason for hiding this comment

msinnha1 Mar 1, 2024

Choose a reason for hiding this comment

msinnha1 left a comment

Choose a reason for hiding this comment

kalyanjk commented Jun 11, 2024

astachowiczhabana commented Jun 12, 2024 • edited Loading

astachowiczhabana commented Jun 12, 2024 •

edited

Loading