Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add mark step and inplace residual add in llama model code to reduce memory consumption #65

Merged
7 commits merged into from
Feb 29, 2024

Conversation

puneeshkhanna
Copy link

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approx twice of (BS, seq len, hidden dim).

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approc twice of (BS, seq len, hidden dim).

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
@puneeshkhanna puneeshkhanna requested a review from a user February 23, 2024 11:46
@puneeshkhanna
Copy link
Author

@dvarshney-habana - please review.
@libinta - Can you please check finetuning once ?

Copy link

@vivekgoe vivekgoe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add mark_step calls under lazy mode flag. Same modeling file is used for torch compile mode also where mark_step is not relevant.

@vivekgoe vivekgoe requested a review from hlahkar February 26, 2024 11:56
@puneeshkhanna
Copy link
Author

puneeshkhanna commented Feb 26, 2024

@MrGeva - You may want to review this. Accuracy seems fine. However I need to address mark step comment from Vivek and also need to check finetuning script.

@puneeshkhanna
Copy link
Author

@mandy-li - this PR is very important from memory usage perspective for llama inference.

As an example for the config of BS-172, seq len-2048, hidden dim-8191 (size is ~5.3 GB) for llama-70B on 8x.
Max memory usage (without flash attention) reduced from ~86 Gb to ~66Gb.
Max memory usage (with flash attention) reduced from ~70 GB to ~59GB.

@puneeshkhanna
Copy link
Author

@schoi-habana - Can you check finetuning once with this PR ?

@puneeshkhanna
Copy link
Author

@vivekgoe - lazy mode flag and check added.

@vivekgoe
Copy link

@vivekgoe - lazy mode flag and check added.

LGTM.

@puneeshkhanna
Copy link
Author

In place add is having loss divergence issue while training so updated PR to perform the in place add operation only in inference.

Ran below command without any fixes of this PR :
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf --deepspeed llama2_ds_zero3_config.json --dataset_name tatsu-lab/alpaca --bf16 True --output_dir ./lora_out --num_train_epochs 1 --max_seq_len 2048 --per_device_train_batch_size 10 --per_device_eval_batch_size 10 --gradient_checkpointing --evaluation_strategy epoch --eval_delay 2 --save_strategy no --learning_rate 0.0018 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --dataset_concatenation --attn_softmax_bf16 True --do_train --do_eval --use_habana --use_lazy_mode --pipelining_fwd_bwd --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --validation_split_percentage 4 --use_flash_attention True
{
"epoch": 1.0,
"eval_accuracy": 0.791171470444553,
"eval_loss": 0.7647133469581604,
"eval_runtime": 27.0564,
"eval_samples": 125,
"eval_samples_per_second": 4.62,
"eval_steps_per_second": 0.074,
"max_memory_allocated (GB)": 81.61,
"memory_allocated (GB)": 26.91,
"perplexity": 2.148378447163791,
"total_memory_available (GB)": 94.62,
"train_loss": 0.8714751173288394,
"train_runtime": 1321.3399,
"train_samples_per_second": 2.628,
"train_steps_per_second": 0.033
}

Ran below command with the updated changes in this PR (only markstep fix will apply to finetuning) :
PT_HPU_MAX_COMPOUND_OP_SIZE=10 DEEPSPEED_HPU_ZERO3_SYNC_MARK_STEP_REQUIRED=1 python3 ../gaudi_spawn.py --use_deepspeed --world_size 8 run_lora_clm.py --model_name_or_path /mnt/weka/data/pytorch/llama2/Llama-2-70b-hf --deepspeed llama2_ds_zero3_config.json --dataset_name tatsu-lab/alpaca --bf16 True --output_dir ./lora_out --num_train_epochs 1 --max_seq_len 2048 --per_device_train_batch_size 10 --per_device_eval_batch_size 10 --gradient_checkpointing --evaluation_strategy epoch --eval_delay 2 --save_strategy no --learning_rate 0.0018 --warmup_ratio 0.03 --lr_scheduler_type "cosine" --logging_steps 1 --dataset_concatenation --attn_softmax_bf16 True --do_train --do_eval --use_habana --use_lazy_mode --pipelining_fwd_bwd --throughput_warmup_steps 3 --lora_rank 4 --lora_target_modules "q_proj" "v_proj" "k_proj" "o_proj" --validation_split_percentage 4 --use_flash_attention True
{
"epoch": 1.0,
"eval_accuracy": 0.7912496336101612,
"eval_loss": 0.7647190690040588,
"eval_runtime": 26.8381,
"eval_samples": 125,
"eval_samples_per_second": 4.658,
"eval_steps_per_second": 0.075,
"max_memory_allocated (GB)": 81.58,
"memory_allocated (GB)": 26.91,
"perplexity": 2.148390740319044,
"total_memory_available (GB)": 94.62,
"train_loss": 0.8714751173288394,
"train_runtime": 1319.6586,
"train_samples_per_second": 2.672,
"train_steps_per_second": 0.034
}

@libinta , @schoi-habana - FYI.
log_lora_without_fixes.txt
log_lora_with_fixes.txt

@ghost ghost merged commit 725a6a3 into HabanaAI:habana-main Feb 29, 2024
schoi-habana pushed a commit that referenced this pull request Mar 1, 2024
…memory consumption (#65)

* Add mark step and inplace add.

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approc twice of (BS, seq len, hidden dim).

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

* Add lazy mode parameter

* Move mark step within the loop

* Move mark step before the loop

* Fix indentation

* update in place add only for inference

---------

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
puneeshkhanna pushed a commit to puneeshkhanna/optimum-habana-fork that referenced this pull request Mar 25, 2024
…memory consumption (HabanaAI#65)

* Add mark step and inplace add.

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approc twice of (BS, seq len, hidden dim).

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

* Add lazy mode parameter

* Move mark step within the loop

* Move mark step before the loop

* Fix indentation

* update in place add only for inference

---------

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
astachowiczhabana pushed a commit that referenced this pull request Apr 19, 2024
…memory consumption (#65)

* Add mark step and inplace add.

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approc twice of (BS, seq len, hidden dim).

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

* Add lazy mode parameter

* Move mark step within the loop

* Move mark step before the loop

* Fix indentation

* update in place add only for inference

---------

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
astachowiczhabana pushed a commit that referenced this pull request Apr 22, 2024
…memory consumption (#65)

* Add mark step and inplace add.

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approc twice of (BS, seq len, hidden dim).

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

* Add lazy mode parameter

* Move mark step within the loop

* Move mark step before the loop

* Fix indentation

* update in place add only for inference

---------

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
astachowiczhabana pushed a commit that referenced this pull request Apr 24, 2024
…memory consumption (#65)

* Add mark step and inplace add.

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approc twice of (BS, seq len, hidden dim).

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

* Add lazy mode parameter

* Move mark step within the loop

* Move mark step before the loop

* Fix indentation

* update in place add only for inference

---------

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
astachowiczhabana pushed a commit that referenced this pull request Apr 24, 2024
…memory consumption (#65)

* Add mark step and inplace add.

Mark step helping in reducing workspace memory by
approx twice of (BS,seq len, hidden dim).

Inplace add helping in reducing persistent tensors by
approc twice of (BS, seq len, hidden dim).

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

* Add lazy mode parameter

* Move mark step within the loop

* Move mark step before the loop

* Fix indentation

* update in place add only for inference

---------

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
@astachowiczhabana
Copy link

huggingface#833

astachowiczhabana pushed a commit that referenced this pull request Jan 17, 2025
ShengYang1 added a commit that referenced this pull request Jan 20, 2025
xinyu-intel pushed a commit that referenced this pull request Mar 4, 2025
This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants