Falcon optimization #974

libinta · 2024-05-10T23:11:14Z

What does this PR do?

add use_flash_attentiong, flash_attention_recompute, flash_attention_causal_mask
add mark step per decoder
add fusedsdpa fp8

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

1. add new args use_flash_attention flash_attention_recompute flash_attention_causal_mask 2. add extra markstep per decoder layer

HuggingFaceDocBuilderDev · 2024-05-10T23:16:22Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ssarkar2 · 2024-05-17T05:25:13Z

measurement:
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1 --warm 0 --n_iter 1 --flash_attention_recompute --flash_attention_causal_mask --use_flash_attention

actual run:
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path tiiuae/falcon-40b --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 2048 --batch_size 4 --bf16 --reuse_cache --bucket_internal --fp8 --max_input_tokens 2048 --warm 2 --n_iter 2 --flash_attention_recompute --flash_attention_causal_mask --use_flash_attention

for 2048->2048, falcon 40b

with bs8:
without PR: 185 tps with 84.2 GB memory
with PR: 187 tps with 72.9 GB

with BS12:
without PR: OOM
with PR: 219 tps with 90.63 HB memory

…um_kv_heads, not num_attention_heads. Impproved performance with removing broadcast as HPU can handle broadcasting in fusedsdpa.

examples/text-generation/README.md

optimum/habana/transformers/models/falcon/modeling_falcon.py

Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

libinta and others added 3 commits May 9, 2024 00:34

Add optimization to falcon

d37e952

1. add new args use_flash_attention flash_attention_recompute flash_attention_causal_mask 2. add extra markstep per decoder layer

added fp8 fusedsdpa

e904f7d

removing sdpa

5a7e87f

libinta requested a review from mandy-li as a code owner May 10, 2024 23:11

libinta requested a review from a user May 10, 2024 23:11

Add parameter passing.

88ed923

libinta requested review from ssarkar2, bhargaveede and vivekgoe as code owners May 11, 2024 02:52

ssarkar2 added 2 commits May 14, 2024 22:09

Remove unused softmax class

dbd9c67

pass in flash_attention_fast_softmax

2a5bdee

ssarkar2 and others added 3 commits May 17, 2024 05:30

style

f8362d1

Keep old changes as well

a465d56

Fixed the kv cache memory issue. KV cache allocation should base on n…

04618b3

…um_kv_heads, not num_attention_heads. Impproved performance with removing broadcast as HPU can handle broadcasting in fusedsdpa.

libinta force-pushed the sasarkar_falcon_opt branch from 306cc7e to 04618b3 Compare May 26, 2024 06:14

Add repeat_kv for performance improve if not using flash_attention.

2368ab9

libinta requested a review from regisss as a code owner May 26, 2024 17:54

libinta added run-test Run CI for PRs from external contributors synapse1.16 labels May 28, 2024

Use fusedsdpa for training also.

04a31d3

schoi-habana reviewed May 29, 2024

View reviewed changes

examples/text-generation/README.md Outdated Show resolved Hide resolved

libinta added 3 commits May 29, 2024 05:55

Add training using fusedsdpa without broadcasting.

578d354

Update readme

5bf02a8

Move gaudi setup to top.

5378c8e

regisss reviewed May 30, 2024

View reviewed changes

optimum/habana/transformers/models/falcon/modeling_falcon.py Outdated Show resolved Hide resolved

optimum/habana/transformers/models/falcon/modeling_falcon.py Outdated Show resolved Hide resolved

Update with review comments.

875cccc

regisss approved these changes May 31, 2024

View reviewed changes

libinta added the synapse 1.16_dependency synapse 1.16 dependency label May 31, 2024

libinta added 2 commits May 31, 2024 13:35

Merge branch 'main' into sasarkar_falcon_opt

2c08d0c

rebase to main.

25eea87

hsubramony added a commit that referenced this pull request May 31, 2024

Falcon optimization #974

ec1eff1

libinta mentioned this pull request Jun 5, 2024

Update text-generation CI configuration for falcon and Mixtral #1044

Merged

3 tasks

regisss approved these changes Jun 6, 2024

View reviewed changes

Merge branch 'main' into sasarkar_falcon_opt

8a006b5

regisss merged commit 5b30679 into main Jun 6, 2024
5 of 8 checks passed

regisss deleted the sasarkar_falcon_opt branch June 6, 2024 23:25

astachowiczhabana mentioned this pull request Jun 11, 2024

Falcon optimization: HabanaAI/optimum-habana-fork#229

Merged

3 tasks

ssarkar2 mentioned this pull request Jun 11, 2024

Sasarkar/qwen optimization #1067

Closed

3 tasks

imangohari1 pushed a commit to imangohari1/optimum-habana that referenced this pull request Jun 13, 2024

Falcon optimization (huggingface#974)

0b470e1

Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Falcon optimization #974

Falcon optimization #974

libinta commented May 10, 2024

HuggingFaceDocBuilderDev commented May 10, 2024

ssarkar2 commented May 17, 2024

Falcon optimization #974

Falcon optimization #974

Conversation

libinta commented May 10, 2024

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented May 10, 2024

ssarkar2 commented May 17, 2024