Add mark step and inplace residual add in llama model code to reduce memory consumption #65

puneeshkhanna · 2024-02-23T11:46:46Z

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approx twice of (BS, seq len, hidden dim).

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Mark step helping in reducing workspace memory by approx twice of (BS,seq len, hidden dim). Inplace add helping in reducing persistent tensors by approc twice of (BS, seq len, hidden dim). Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

puneeshkhanna · 2024-02-23T11:57:50Z

@dvarshney-habana - please review.
@libinta - Can you please check finetuning once ?

vivekgoe

Please add mark_step calls under lazy mode flag. Same modeling file is used for torch compile mode also where mark_step is not relevant.

puneeshkhanna · 2024-02-26T14:22:10Z

@MrGeva - You may want to review this. Accuracy seems fine. However I need to address mark step comment from Vivek and also need to check finetuning script.

puneeshkhanna · 2024-02-26T15:29:59Z

@mandy-li - this PR is very important from memory usage perspective for llama inference.

As an example for the config of BS-172, seq len-2048, hidden dim-8191 (size is ~5.3 GB) for llama-70B on 8x.
Max memory usage (without flash attention) reduced from ~86 Gb to ~66Gb.
Max memory usage (with flash attention) reduced from ~70 GB to ~59GB.

puneeshkhanna · 2024-02-26T15:31:49Z

@schoi-habana - Can you check finetuning once with this PR ?

puneeshkhanna · 2024-02-27T08:26:31Z

@vivekgoe - lazy mode flag and check added.

vivekgoe · 2024-02-28T07:57:13Z

@vivekgoe - lazy mode flag and check added.

LGTM.

puneeshkhanna · 2024-02-29T11:04:12Z

In place add is having loss divergence issue while training so updated PR to perform the in place add operation only in inference.

Ran below command without any fixes of this PR :
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf --deepspeed llama2_ds_zero3_config.json --dataset_name tatsu-lab/alpaca --bf16 True --output_dir ./lora_out --num_train_epochs 1 --max_seq_len 2048 --per_device_train_batch_size 10 --per_device_eval_batch_size 10 --gradient_checkpointing --evaluation_strategy epoch --eval_delay 2 --save_strategy no --learning_rate 0.0018 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --dataset_concatenation --attn_softmax_bf16 True --do_train --do_eval --use_habana --use_lazy_mode --pipelining_fwd_bwd --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --validation_split_percentage 4 --use_flash_attention True
{
"epoch": 1.0,
"eval_accuracy": 0.791171470444553,
"eval_loss": 0.7647133469581604,
"eval_runtime": 27.0564,
"eval_samples": 125,
"eval_samples_per_second": 4.62,
"eval_steps_per_second": 0.074,
"max_memory_allocated (GB)": 81.61,
"memory_allocated (GB)": 26.91,
"perplexity": 2.148378447163791,
"total_memory_available (GB)": 94.62,
"train_loss": 0.8714751173288394,
"train_runtime": 1321.3399,
"train_samples_per_second": 2.628,
"train_steps_per_second": 0.033
}

Ran below command with the updated changes in this PR (only markstep fix will apply to finetuning) :
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf --deepspeed llama2_ds_zero3_config.json --dataset_name tatsu-lab/alpaca --bf16 True --output_dir ./lora_out --num_train_epochs 1 --max_seq_len 2048 --per_device_train_batch_size 10 --per_device_eval_batch_size 10 --gradient_checkpointing --evaluation_strategy epoch --eval_delay 2 --save_strategy no --learning_rate 0.0018 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --dataset_concatenation --attn_softmax_bf16 True --do_train --do_eval --use_habana --use_lazy_mode --pipelining_fwd_bwd --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --validation_split_percentage 4 --use_flash_attention True
{
"epoch": 1.0,
"eval_accuracy": 0.7912496336101612,
"eval_loss": 0.7647190690040588,
"eval_runtime": 26.8381,
"eval_samples": 125,
"eval_samples_per_second": 4.658,
"eval_steps_per_second": 0.075,
"max_memory_allocated (GB)": 81.58,
"memory_allocated (GB)": 26.91,
"perplexity": 2.148390740319044,
"total_memory_available (GB)": 94.62,
"train_loss": 0.8714751173288394,
"train_runtime": 1319.6586,
"train_samples_per_second": 2.672,
"train_steps_per_second": 0.034
}

@libinta , @schoi-habana - FYI.
log_lora_without_fixes.txt
log_lora_with_fixes.txt

…memory consumption (#65) * Add mark step and inplace add. Mark step helping in reducing workspace memory by approx twice of (BS,seq len, hidden dim). Inplace add helping in reducing persistent tensors by approc twice of (BS, seq len, hidden dim). Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Add lazy mode parameter * Move mark step within the loop * Move mark step before the loop * Fix indentation * update in place add only for inference --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

…memory consumption (HabanaAI#65) * Add mark step and inplace add. Mark step helping in reducing workspace memory by approx twice of (BS,seq len, hidden dim). Inplace add helping in reducing persistent tensors by approc twice of (BS, seq len, hidden dim). Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Add lazy mode parameter * Move mark step within the loop * Move mark step before the loop * Fix indentation * update in place add only for inference --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

…memory consumption (#65) * Add mark step and inplace add. Mark step helping in reducing workspace memory by approx twice of (BS,seq len, hidden dim). Inplace add helping in reducing persistent tensors by approc twice of (BS, seq len, hidden dim). Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Add lazy mode parameter * Move mark step within the loop * Move mark step before the loop * Fix indentation * update in place add only for inference --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

astachowiczhabana · 2024-06-07T14:15:26Z

huggingface#833

Add mark step and inplace add.

389938a

Mark step helping in reducing workspace memory by approx twice of (BS,seq len, hidden dim). Inplace add helping in reducing persistent tensors by approc twice of (BS, seq len, hidden dim). Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

puneeshkhanna requested review from mandy-li and libinta as code owners February 23, 2024 11:46

puneeshkhanna requested a review from a user February 23, 2024 11:46

ghost approved these changes Feb 24, 2024

View reviewed changes

vivekgoe requested changes Feb 26, 2024

View reviewed changes

vivekgoe requested a review from hlahkar February 26, 2024 11:56

Merge branch 'HabanaAI:habana-main' into llama_prefill_memoryfixes

69b5b32

Add lazy mode parameter

094f0b7

puneeshkhanna requested review from ssarkar2 and bhargaveede as code owners February 27, 2024 08:24

Move mark step within the loop

586e8ce

puneeshkhanna mentioned this pull request Feb 28, 2024

Split the graphs to run with flash_attention on 1x #75

Merged

vivekgoe approved these changes Feb 28, 2024

View reviewed changes

Puneesh Khanna added 2 commits February 28, 2024 13:45

Move mark step before the loop

f34fae3

Fix indentation

6c6e5da

update in place add only for inference

8eab266

ghost merged commit 725a6a3 into HabanaAI:habana-main Feb 29, 2024

schoi-habana mentioned this pull request Apr 8, 2024

Update Mixtral-8x7B Optimization huggingface/optimum-habana#836

Closed

astachowiczhabana pushed a commit that referenced this pull request Jan 17, 2025

Fix graph breaks in Mixtral (#65)

9c0fbdc

ShengYang1 added a commit that referenced this pull request Jan 20, 2025

Fix graph breaks in Mixtral (#65)

bdc7332

ugolowic pushed a commit that referenced this pull request Feb 20, 2025

Fix graph breaks in Mixtral (#65) (huggingface#1705)

6a520ff

xinyu-intel pushed a commit that referenced this pull request Mar 4, 2025

Fix graph breaks in Mixtral (#65)

c3ec015

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add mark step and inplace residual add in llama model code to reduce memory consumption #65

Add mark step and inplace residual add in llama model code to reduce memory consumption #65

puneeshkhanna commented Feb 23, 2024

puneeshkhanna commented Feb 23, 2024

vivekgoe left a comment

puneeshkhanna commented Feb 26, 2024 •

edited

Loading

puneeshkhanna commented Feb 26, 2024

puneeshkhanna commented Feb 26, 2024

puneeshkhanna commented Feb 27, 2024

vivekgoe commented Feb 28, 2024

puneeshkhanna commented Feb 29, 2024

astachowiczhabana commented Jun 7, 2024

Add mark step and inplace residual add in llama model code to reduce memory consumption #65

Add mark step and inplace residual add in llama model code to reduce memory consumption #65

Conversation

puneeshkhanna commented Feb 23, 2024

What does this PR do?

Before submitting

puneeshkhanna commented Feb 23, 2024

vivekgoe left a comment

Choose a reason for hiding this comment

puneeshkhanna commented Feb 26, 2024 • edited Loading

puneeshkhanna commented Feb 26, 2024

puneeshkhanna commented Feb 26, 2024

puneeshkhanna commented Feb 27, 2024

vivekgoe commented Feb 28, 2024

puneeshkhanna commented Feb 29, 2024

astachowiczhabana commented Jun 7, 2024

puneeshkhanna commented Feb 26, 2024 •

edited

Loading