Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add fp8 related changes to mistral for text-generation #918

Merged
merged 23 commits into from
May 7, 2024

Conversation

skaulintel
Copy link
Collaborator

@skaulintel skaulintel commented Apr 23, 2024

What does this PR do?

Initial mistral fp8 change

Command Lines:

  1. 128x128xbs4

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --fp8 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13250.825658116784 tokens/second
Number of HPU graphs = 85
Memory allocated = 38.37 GB
Max memory allocated = 94.61 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.98284676099138 seconds

  1. 2048x128

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120--fp8 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs

Throughput (including tokenization) = 1362.8371789032228 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.82 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.72206230499432 seconds

  1. 2048x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --fp8 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3105.9817365063354 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.62 GB
Total memory available = 94.62 GB
Graph compilation duration = 414.38635561900446 seconds

  1. 128x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --fp8 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 7738.114888711109 tokens/second
Number of HPU graphs = 565
Memory allocated = 74.97 GB
Max memory allocated = 94.61 GB
Total memory available = 94.62 GB
Graph compilation duration = 405.53613558399957 seconds

@skaulintel skaulintel requested a review from regisss as a code owner April 23, 2024 17:54
@skaulintel skaulintel requested review from libinta and jiminha April 23, 2024 17:56
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@skaulintel skaulintel marked this pull request as draft April 23, 2024 22:00
@skaulintel skaulintel changed the title donotmerge: add fp8 related changes to mistral for text-generation add fp8 related changes to mistral for text-generation Apr 25, 2024
@skaulintel skaulintel marked this pull request as ready for review April 25, 2024 22:34
@libinta libinta added the run-test Run CI for PRs from external contributors label Apr 29, 2024
@regisss regisss merged commit 9f6eba3 into main May 7, 2024
9 checks passed
@regisss regisss deleted the skaulintel/mistral_fp8 branch May 7, 2024 22:16
ccrhx4 pushed a commit to ccrhx4/ccrhx4.optimum-habana that referenced this pull request May 11, 2024
Co-authored-by: Jimin Ha <jha@habana.ai>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
@mandy-li mandy-li requested review from schoi-habana and removed request for schoi-habana May 17, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
run-test Run CI for PRs from external contributors synapse1.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants