Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Falcon optimization #974

Merged
merged 18 commits into from
Jun 6, 2024
Merged

Falcon optimization #974

merged 18 commits into from
Jun 6, 2024

Conversation

libinta
Copy link
Collaborator

@libinta libinta commented May 10, 2024

What does this PR do?

  1. add use_flash_attentiong, flash_attention_recompute, flash_attention_causal_mask
  2. add mark step per decoder
  3. add fusedsdpa fp8

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

libinta and others added 3 commits May 9, 2024 00:34
1. add new args use_flash_attention flash_attention_recompute flash_attention_causal_mask
2. add extra markstep per decoder layer
@libinta libinta requested a review from mandy-li as a code owner May 10, 2024 23:11
@libinta libinta requested a review from a user May 10, 2024 23:11
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ssarkar2
Copy link
Collaborator

measurement:
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1 --warm 0 --n_iter 1 --flash_attention_recompute --flash_attention_causal_mask --use_flash_attention

actual run:
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path tiiuae/falcon-40b --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 2048 --batch_size 4 --bf16 --reuse_cache --bucket_internal --fp8 --max_input_tokens 2048 --warm 2 --n_iter 2 --flash_attention_recompute --flash_attention_causal_mask --use_flash_attention

for 2048->2048, falcon 40b

with bs8:
without PR: 185 tps with 84.2 GB memory
with PR: 187 tps with 72.9 GB

with BS12:
without PR: OOM
with PR: 219 tps with 90.63 HB memory

ssarkar2 and others added 3 commits May 17, 2024 05:30
…um_kv_heads, not num_attention_heads.

Impproved performance with removing broadcast as HPU can handle broadcasting in fusedsdpa.
@libinta libinta force-pushed the sasarkar_falcon_opt branch from 306cc7e to 04618b3 Compare May 26, 2024 06:14
@libinta libinta requested a review from regisss as a code owner May 26, 2024 17:54
@libinta libinta added run-test Run CI for PRs from external contributors synapse1.16 labels May 28, 2024
@libinta libinta added the synapse 1.16_dependency synapse 1.16 dependency label May 31, 2024
@regisss regisss merged commit 5b30679 into main Jun 6, 2024
5 of 8 checks passed
@regisss regisss deleted the sasarkar_falcon_opt branch June 6, 2024 23:25
@ssarkar2 ssarkar2 mentioned this pull request Jun 11, 2024
3 tasks
imangohari1 pushed a commit to imangohari1/optimum-habana that referenced this pull request Jun 13, 2024
Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors synapse 1.16_dependency synapse 1.16 dependency synapse1.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants