Gemma: enabled HPU Graphs and Flash Attention #1173

dsmertin · 2024-07-30T14:10:02Z

What does this PR do?

This PR fixes HPU Graphs usage and Flash Attention for Gemma model.
Changes are based on Starcoder 2 and Qwen 2 implementations.

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

yafshar · 2024-09-06T21:49:46Z

@dsmertin can you re-base this PR with the main?

imangohari1

Hi @dsmertin
Thank you for this PR.
Could you please address the followings and push the changes?

Pls. rebase/sync on top of OH main. Currently 72 commits behind.
Pls. make sure to run make style
Pls. share the results of these gemma before/after these changes.
We need to test/update CI tests with these changes.

imangohari1 · 2024-09-11T20:09:35Z

optimum/habana/transformers/models/gemma/modeling_gemma.py

+        q_len = query_layer.size(-2)
+        q_tiles = (q_len // q_block_size) if (q_len % q_block_size == 0) else math.ceil(q_len / q_block_size)
+        q_padding = q_tiles * q_block_size - q_len
+        query_layer = F.pad(query_layer, (0, 0, 0, q_padding), "constant", 0)


make style here complains the F is not defined.
Pls. take a look.

imangohari1 · 2024-09-11T20:09:47Z

optimum/habana/transformers/models/gemma/modeling_gemma.py

+        q_padding = q_tiles * q_block_size - q_len
+        query_layer = F.pad(query_layer, (0, 0, 0, q_padding), "constant", 0)
+        if attention_mask is not None:
+            attention_mask = F.pad(attention_mask, (0, 0, 0, q_padding), "constant", -10000.0)


same here about make style

imangohari1

I did a bit more testing here and added few more comments.
this PR also has merge conflict with main.
Please make sure to address them during the rebase, and make style.
Thanks.

imangohari1 · 2024-09-11T20:10:56Z

optimum/habana/transformers/models/gemma/modeling_gemma.py

+        - add new arg flash_attention_recompute
+        """
+        if "padding_mask" in kwargs:
+            warnings.warn(


make style here complains the warnings are not defined.
shouldn't this be a logger.warning_once

imangohari1

I fixed some issues and updated the code with make style and added --use_flash_attention to gemma CI test.
Please apply the attach fix with git am < 0001* and push.
Thanks.
0001-fix-gemma-make-style.-minor-fixes.patch

yafshar · 2024-09-16T18:10:40Z

@dsmertin can you address the comments?

dsmertin · 2024-09-17T14:47:06Z

@yafshar @imangohari1
This PR was created a month and a half ago for different version of the software stack. There was a difference and improvement in performance for this model, but I'm not sure if it's still the case. My colleague told me that there is not much difference for 17.1 version and I need to double check it now. If there'll be no improvement, I'll close it then. Otherwise, I'll go through the comments.

imangohari1 · 2024-09-17T15:23:26Z

@yafshar @imangohari1 This PR was created a month and a half ago for different version of the software stack. There was a difference and improvement in performance for this model, but I'm not sure if it's still the case. My colleague told me that there is not much difference for 17.1 version and I need to double check it now. If there'll be no improvement, I'll close it then. Otherwise, I'll go through the comments.

@dsmertin
Thanks.
I have tested this PR with the latest release and we would like this PR to be included for the next release.

Please work through the changes I have suggested here and push them so we can test this further on the RC.

thanks.

I fixed some issues and updated the code with make style and added --use_flash_attention to gemma CI test. Please apply the attach fix with git am < 0001* and push. Thanks. 0001-fix-gemma-make-style.-minor-fixes.patch

dsmertin · 2024-09-18T15:25:57Z

I've updated the branch with rebase and your patch @imangohari1 .
I'm confused a little, what else should be done?

imangohari1

LGTM!.
@regisss please take a look. Thanks

github-actions · 2024-09-23T18:49:05Z

The code quality check failed, please run make style.

HuggingFaceDocBuilderDev · 2024-09-23T18:52:19Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss · 2024-09-23T19:52:57Z

Ci failed with:

E     File "/root/workspace/optimum/habana/transformers/models/gemma/modeling_gemma.py", line 552
E       use_cache = use_cache if use_cache is not None else self.config.use_cache
E   IndentationError: unexpected indent

optimum/habana/transformers/trainer.py

optimum/habana/transformers/generation/utils.py

dsmertin requested review from ssarkar2, bhargaveede, vivekgoe and regisss as code owners July 30, 2024 14:10

imangohari1 suggested changes Sep 11, 2024

View reviewed changes

imangohari1 reviewed Sep 11, 2024

View reviewed changes

imangohari1 suggested changes Sep 11, 2024

View reviewed changes

dsmertin force-pushed the ds/gemma-optimization branch from 686bf2d to 1f71df0 Compare September 18, 2024 15:09

dsmertin and others added 3 commits September 18, 2024 15:13

gemma alligned with starcoder2 & qwen2

7dec144

fix normalizer device

b07d0c3

fix rebase error

1f71df0

style & minor changes from @imangohari1

6e9dad4

imangohari1 approved these changes Sep 18, 2024

View reviewed changes

libinta added the run-test Run CI for PRs from external contributors label Sep 18, 2024

dsmertin added 2 commits September 24, 2024 11:01

coming back to _gaudi_prepare_4d_causal_attention_mask

eed2005

Merge branch 'main' into ds/gemma-optimization

95ae6f2

regisss reviewed Sep 24, 2024

View reviewed changes

optimum/habana/transformers/trainer.py Outdated Show resolved Hide resolved

optimum/habana/transformers/trainer.py Outdated Show resolved Hide resolved

optimum/habana/transformers/generation/utils.py Outdated Show resolved Hide resolved

added missed pieces

e9e9e35

regisss approved these changes Sep 24, 2024

View reviewed changes

regisss merged commit 00dd5bf into huggingface:main Sep 24, 2024
3 of 4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gemma: enabled HPU Graphs and Flash Attention #1173

Gemma: enabled HPU Graphs and Flash Attention #1173

dsmertin commented Jul 30, 2024

yafshar commented Sep 6, 2024

imangohari1 left a comment

imangohari1 Sep 11, 2024

imangohari1 Sep 11, 2024

imangohari1 left a comment

imangohari1 Sep 11, 2024

imangohari1 left a comment

yafshar commented Sep 16, 2024

dsmertin commented Sep 17, 2024

imangohari1 commented Sep 17, 2024

dsmertin commented Sep 18, 2024

imangohari1 left a comment

github-actions bot commented Sep 23, 2024

HuggingFaceDocBuilderDev commented Sep 23, 2024

regisss commented Sep 23, 2024

Gemma: enabled HPU Graphs and Flash Attention #1173

Gemma: enabled HPU Graphs and Flash Attention #1173

Conversation

dsmertin commented Jul 30, 2024

What does this PR do?

Before submitting

yafshar commented Sep 6, 2024

imangohari1 left a comment

Choose a reason for hiding this comment

imangohari1 Sep 11, 2024

Choose a reason for hiding this comment

imangohari1 Sep 11, 2024

Choose a reason for hiding this comment

imangohari1 left a comment

Choose a reason for hiding this comment

imangohari1 Sep 11, 2024

Choose a reason for hiding this comment

imangohari1 left a comment

Choose a reason for hiding this comment

yafshar commented Sep 16, 2024

dsmertin commented Sep 17, 2024

imangohari1 commented Sep 17, 2024

dsmertin commented Sep 18, 2024

imangohari1 left a comment

Choose a reason for hiding this comment

github-actions bot commented Sep 23, 2024

HuggingFaceDocBuilderDev commented Sep 23, 2024

regisss commented Sep 23, 2024