gradient overflow when training 13b Llama Model on 7 a100s

![image](https://user-images.githubusercontent.com/47894192/236276560-049a0013-0937-4891-a433-1bd61f5863a1.png)

Getting gradient overflow and skipped step every 2 or so steps. Training the 13b llama model on 7 a100s with context window of 512. Below is the command line run. When I tried to config state 3 or tried to get rid of gradient accumulation steps the GPU would run out of memory when attempting to load the model into memory at the start of training. Any suggestions on how to get rid of the gradient overflow issue or how to partition the model and load parts of it into multiple GPUs at the start of training? Would be super grateful for help!

deepspeed --num_gpus=7 run_clm.py --deepspeed ds_config_stage2.json --model_name_or_path decapoda-research/llama-13b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --bf16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 400 --gradient_accumulation_steps 3 --per_device_train_batch_size 2 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 400 --save_strategy steps --load_best_model_at_end=True --block_size=512 --report_to=wandb


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

gradient overflow when training 13b Llama Model on 7 a100s #19

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

gradient overflow when training 13b Llama Model on 7 a100s #19

Description

Activity

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions