Fix for OOM Errors during Ultrachat200k Finetuning #2180

Satrat · 2024-03-13T20:41:06Z

The following SparseZoo recipes were causing OOM errors even with FSDP:

"zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned40"
"zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned40_quantized"

The fix was to offload weights to CPU when gathering FSDP params during modifier initialization and finalization

Test

Tested on 6 48GB GPUs. Example only runs a few training samples for ease of testing.

from sparseml.transformers import compress, SparseAutoModelForCausalLM, SparseAutoTokenizer

model = SparseAutoModelForCausalLM.from_pretrained("zoo:llama2-7b-ultrachat200k_llama2_pretrain-base")
teacher = SparseAutoModelForCausalLM.from_pretrained("zoo:llama2-7b-ultrachat200k_llama2_pretrain-base")
tokenizer = SparseAutoTokenizer.from_pretrained("zoo:llama2-7b-ultrachat200k_llama2_pretrain-base")
dataset="open_platypus"
MODEL_STUB="zoo:llama2-7b-ultrachat200k_llama2_pretrain-pruned40"

compress(
    model=model,
    distill_teacher=teacher,
    tokenizer=tokenizer,
    dataset=dataset,
    recipe=MODEL_STUB,
    output_dir="./output",
    gradient_checkpointing = True,
    num_train_epochs=0.02
)

Run with FSDP: accelerate launch --config_file fsdp_config.yaml test.py

* testing fix * get rid of repeated log * revert yaml

Sara Adkins added 3 commits March 13, 2024 19:00

testing fix

8bee1a5

get rid of repeated log

8f8e579

revert yaml

e555504

Satrat marked this pull request as ready for review March 13, 2024 20:46

Satrat requested review from dbogunowicz, bfineran, dsikka, rahul-tuli and horheynm March 13, 2024 20:46

Merge branch 'main' into oom_debug

96b0625

rahul-tuli approved these changes Mar 13, 2024

View reviewed changes

bfineran approved these changes Mar 13, 2024

View reviewed changes

Satrat merged commit 965bdfa into main Mar 14, 2024

Satrat deleted the oom_debug branch March 14, 2024 03:05

Satrat pushed a commit that referenced this pull request Mar 14, 2024

Fix for OOM Errors during Ultrachat200k Finetuning (#2180)

1320669

* testing fix * get rid of repeated log * revert yaml

Satrat pushed a commit that referenced this pull request Mar 14, 2024

Fix for OOM Errors during Ultrachat200k Finetuning (#2180) (#2181)

3bf79ad

* testing fix * get rid of repeated log * revert yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix for OOM Errors during Ultrachat200k Finetuning #2180

Fix for OOM Errors during Ultrachat200k Finetuning #2180

Uh oh!

Satrat commented Mar 13, 2024 •

edited

Loading

Uh oh!

Uh oh!

Fix for OOM Errors during Ultrachat200k Finetuning #2180

Fix for OOM Errors during Ultrachat200k Finetuning #2180

Uh oh!

Conversation

Satrat commented Mar 13, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test

Uh oh!

Uh oh!

Satrat commented Mar 13, 2024 •

edited

Loading