Skip to content

gradient overflow when training 13b Llama Model on 7 a100s #19

Open
@awrd2019

Description

@awrd2019

image

Getting gradient overflow and skipped step every 2 or so steps. Training the 13b llama model on 7 a100s with context window of 512. Below is the command line run. When I tried to config state 3 or tried to get rid of gradient accumulation steps the GPU would run out of memory when attempting to load the model into memory at the start of training. Any suggestions on how to get rid of the gradient overflow issue or how to partition the model and load parts of it into multiple GPUs at the start of training? Would be super grateful for help!

deepspeed --num_gpus=7 run_clm.py --deepspeed ds_config_stage2.json --model_name_or_path decapoda-research/llama-13b-hf --train_file train.csv --validation_file validation.csv --do_train --do_eval --bf16 --overwrite_cache --evaluation_strategy=steps --output_dir finetuned --num_train_epochs 1 --eval_steps 400 --gradient_accumulation_steps 3 --per_device_train_batch_size 2 --use_fast_tokenizer False --learning_rate 5e-06 --warmup_steps 10 --save_total_limit 1 --save_steps 400 --save_strategy steps --load_best_model_at_end=True --block_size=512 --report_to=wandb

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions