You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When trying to start a full FT of Llama 7b on an 4*V100s instance (using this config without bf16, also tried other variations e.g. with fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer ) with accelerate, CPU ram fills until process termination.
I though that #25107 should have solved this, but whatever I do, can't get it to work. Could the Volta arch be a reason for this?
Information
The official example scripts
My own modified scripts
Tasks
An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
Hello, can you share the training code? Please make sure that torch distributed process group is already initialized before loading the pretrained model. When using Trainer, make sure the object of TrainingArguments is created before loading the pretrained model as it initializes the torch-distributed process group.
training_arguments = TrainingArguments(
...
)
model = AutoModelForCausalLM.from_pretrained()
...
This is because we want only the main process to have the pretrained model loaded and all other processes to have empty weights. For this to happen, the process group needs to be initialized via torch.distributed.init_process_group which happens when creating an object of TrainingArguments. See the check here needed for RAM efficient FSDP loading
System Info
transformers
version: 4.34.0.dev0Who can help?
@pacman100
When trying to start a full FT of Llama 7b on an 4*V100s instance (using this config without bf16, also tried other variations e.g. with
fsdp_transformer_layer_cls_to_wrap: LlamaDecoderLayer
) with accelerate, CPU ram fills until process termination.I though that #25107 should have solved this, but whatever I do, can't get it to work. Could the Volta arch be a reason for this?
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
accelerate config:
seq_len 2048, llama2-7b, happens with different datasets, 4* V100s, 173GB ram
Expected behavior
Model loads and finetuning works with FSDP
The text was updated successfully, but these errors were encountered: