Significantly increased VRAM usage for Mixtral qlora training compared to 4.36.2? #28339

DocShotgun · 2024-01-04T08:24:43Z

System Info

The environment is a Runpod container with python 3.10, single A100 80gb, transformers 4.37.0dev (3cefac1), using axolotl training script (https://github.com/OpenAccess-AI-Collective/axolotl).

Who can help?

No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Hello, just tried doing a training run on the dev version of transformers (as of 3cefac1) via the common training repository axolotl (https://github.com/OpenAccess-AI-Collective/axolotl) and noticed that I went OOM using the same configuration that I had previously used successfully with transformers 4.36.2 stable. And not even just a small difference - I had to reduce my batch size by 4x to make the training fit in VRAM.

I was previously able to fit 8192 ctx, batch size 4, grad accum steps 2 without difficulty, but I found that I now had to reduce my batch size to 1 to avoid OOM. The relevant training hyperparameters are:

load_in_4bit: true
sequence_len: 8192
sample_packing: true
pad_to_sequence_len: true
lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
optimizer: adamw_bnb_8bit
bf16: true
fp16: false
tf32: true
gradient_checkpointing: true
flash_attention: true

no deepspeed or fsdp
no evals

Would appreciate any insights into what caused the massive increase in memory usage. I noticed that ehartford's latest dolphin 2.7 qlora used a batch size of 3 per device at 16k ctx on A100 80gb, so surely I'm missing something here?

Expected behavior

The training run should take a relatively similar amount of VRAM as it did previously with the same config.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-01-05T10:01:54Z

Hey! Thanks for the report, here are potential PRs that I would suspect:

Fix FA2 integration #28142 which fixes some of the FA2 bugs
[Modeling / Mixtral] Fix GC + PEFT issues with Mixtral #28061 wich should fix gradient checkpointing issues
Pinging @younesbelkada for when he comes back, would be great if you can isolate the commit that lead to this in the meantime 🤗

MB7979 · 2024-01-07T00:32:48Z

This may not be relevant to you but I found this recent change to Axolotl has made a significant difference to VRAM usage. Previously I could just squeeze in a LoRA on a 34B model on my 3x3090s at batch size 2, seq length 4096, now it OOMs immediately. I undid the change and it fits again.

DocShotgun · 2024-01-07T05:59:35Z

This may not be relevant to you but I found this recent change to Axolotl has made a significant difference to VRAM usage. Previously I could just squeeze in a LoRA on a 34B model on my 3x3090s at batch size 2, seq length 4096, now it OOMs immediately. I undid the change and it fits again.

Hmm it's certainly possible since that commit was in between when I did my initial train and the run where I had to drop the batch size. Unfortunately don't have a training instance up right now, so I'd have to test it the next time I try to train.

DocShotgun · 2024-02-02T20:44:26Z

I've determined that the cause of the increased VRAM usage was indeed axolotl changing the default for use_reentrant to False for gradient checkpointing. Going to go ahead and close the issue.

ArthurZucker · 2024-02-05T02:23:40Z

thanks for sharing the solution! 🤗

DocShotgun closed this as completed Feb 2, 2024

NanoCode012 mentioned this issue Mar 16, 2024

fix(config): passing gradient_checkpoint_kwargs axolotl-ai-cloud/axolotl#1412

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Significantly increased VRAM usage for Mixtral qlora training compared to 4.36.2? #28339

Significantly increased VRAM usage for Mixtral qlora training compared to 4.36.2? #28339

DocShotgun commented Jan 4, 2024

ArthurZucker commented Jan 5, 2024

MB7979 commented Jan 7, 2024

DocShotgun commented Jan 7, 2024

DocShotgun commented Feb 2, 2024

ArthurZucker commented Feb 5, 2024

Significantly increased VRAM usage for Mixtral qlora training compared to 4.36.2? #28339

Significantly increased VRAM usage for Mixtral qlora training compared to 4.36.2? #28339

Comments

DocShotgun commented Jan 4, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

ArthurZucker commented Jan 5, 2024

MB7979 commented Jan 7, 2024

DocShotgun commented Jan 7, 2024

DocShotgun commented Feb 2, 2024

ArthurZucker commented Feb 5, 2024