add fp8 related changes to mistral for text-generation #918

skaulintel · 2024-04-23T17:54:02Z

What does this PR do?

Initial mistral fp8 change

Command Lines:

128x128xbs4

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --fp8 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13250.825658116784 tokens/second
Number of HPU graphs = 85
Memory allocated = 38.37 GB
Max memory allocated = 94.61 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.98284676099138 seconds

2048x128

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120--fp8 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs

Throughput (including tokenization) = 1362.8371789032228 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.82 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.72206230499432 seconds

2048x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --fp8 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3105.9817365063354 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.62 GB
Total memory available = 94.62 GB
Graph compilation duration = 414.38635561900446 seconds

128x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --fp8 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 7738.114888711109 tokens/second
Number of HPU graphs = 565
Memory allocated = 74.97 GB
Max memory allocated = 94.61 GB
Total memory available = 94.62 GB
Graph compilation duration = 405.53613558399957 seconds

HuggingFaceDocBuilderDev · 2024-04-23T17:59:27Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…ce/optimum-habana into skaulintel/mistral_fp8

remove padding_mask warning

optimum/habana/transformers/models/mistral/modeling_mistral.py

Co-authored-by: Jimin Ha <jha@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

add fp8 related changes to mistral for text-generation

2549290

skaulintel requested a review from regisss as a code owner April 23, 2024 17:54

add KVCache object

513aa10

skaulintel requested review from libinta and jiminha April 23, 2024 17:56

skaulintel marked this pull request as draft April 23, 2024 22:00

jiminha and others added 4 commits April 23, 2024 16:15

Fix layer_idx warning issue

be79f0f

Style fix

e955c20

add reuse_cache and some other arguments to mistral inputs

ffaaa1d

Merge branch 'skaulintel/mistral_fp8' of https://github.com/huggingfa…

7b950fe

…ce/optimum-habana into skaulintel/mistral_fp8

skaulintel changed the title ~~donotmerge: add fp8 related changes to mistral for text-generation~~ add fp8 related changes to mistral for text-generation Apr 25, 2024

skaulintel added 3 commits April 25, 2024 17:48

style reformat

0ee0339

Update modeling_mistral.py

dc7afb4

remove padding_mask warning

style fix

2e20a3f

jiminha approved these changes Apr 25, 2024

View reviewed changes

skaulintel marked this pull request as ready for review April 25, 2024 22:34

Merge branch 'main' into skaulintel/mistral_fp8

e661aac

libinta added the run-test Run CI for PRs from external contributors label Apr 29, 2024

This was referenced Apr 30, 2024

Add support for Mistral fp8 #935

Closed

Added Mistral fp8 support HabanaAI/optimum-habana-fork#185

Merged

regisss reviewed May 2, 2024

View reviewed changes

skaulintel added 7 commits May 2, 2024 10:59

move logging to before model definition

3e1dcdc

adjust use cache logic

c30d43e

adjust use cache logic

aebeeca

change mistral decoder layer init

eb174e3

remove unnecessary update_sincos_cache

234840e

small change in forward pass if use_cache

2130194

style fix

46aa95e

regisss mentioned this pull request May 4, 2024

Support Mistral 32K input token #931

Merged

skaulintel added 3 commits May 6, 2024 18:19

remove use_fused_rope

176c5f6

start to add mistral ci tests

224559c

add mistral CI tests

a6669cf

libinta added the synapse1.16 label May 7, 2024

skaulintel and others added 3 commits May 7, 2024 14:06

Merge branch 'main' into skaulintel/mistral_fp8

71e7fe7

Fix

08f1505

Factorize code

bcdcfdd

regisss approved these changes May 7, 2024

View reviewed changes

regisss merged commit 9f6eba3 into main May 7, 2024
9 checks passed

regisss deleted the skaulintel/mistral_fp8 branch May 7, 2024 22:16

skaulintel mentioned this pull request May 7, 2024

Skaulintel/ci mistral fp8 #924

Closed

ccrhx4 pushed a commit to ccrhx4/ccrhx4.optimum-habana that referenced this pull request May 11, 2024

Add fp8 related changes to mistral for text-generation (huggingface#918)

49cedf4

Co-authored-by: Jimin Ha <jha@habana.ai> Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>

mandy-li requested review from schoi-habana and removed request for schoi-habana May 17, 2024 16:28

This was referenced Jun 12, 2024

update kvcache mistral HabanaAI/optimum-habana-fork#145

Merged

move kv cache object back modeling_mistral.py HabanaAI/optimum-habana-fork#254

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add fp8 related changes to mistral for text-generation #918

add fp8 related changes to mistral for text-generation #918

skaulintel commented Apr 23, 2024 •

edited by jiminha

Loading

HuggingFaceDocBuilderDev commented Apr 23, 2024

add fp8 related changes to mistral for text-generation #918

add fp8 related changes to mistral for text-generation #918

Conversation

skaulintel commented Apr 23, 2024 • edited by jiminha Loading

What does this PR do?

Command Lines:

HuggingFaceDocBuilderDev commented Apr 23, 2024

skaulintel commented Apr 23, 2024 •

edited by jiminha

Loading