-
Notifications
You must be signed in to change notification settings - Fork 31.2k
Description
System Info
transformers version 4.43.1, other package versions here: https://github.com/allenai/open-instruct/blob/main/requirements.txt
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Running: unset CUDA_LAUNCH_BLOCKING && accelerate launch --mixed_precision bf16 --num_machines 2 --num_processes 16 --machine_rank $BEAKER_REPLICA_RANK --main_process_ip $BEAKER_LEADER_REPLICA_HOSTNAME --main_process_port 29400 --use_deepspeed --deepspeed_config_file configs/ds_configs/stage3_no_offloading_accelerate.conf --deepspeed_multinode_launcher standard open_instruct/finetune.py --model_name_or_path meta-llama/Meta-Llama-3.1-8B --tokenizer_name meta-llama/Meta-Llama-3.1-8B --use_slow_tokenizer --dataset_name allenai/tulu-v2-sft-mixture --use_flash_attn --max_seq_length 4096 --preprocessing_num_workers 16 --per_device_train_batch_size 1 --gradient_accumulation_steps 8 --learning_rate 5e-6 --lr_scheduler_type linear --warmup_ratio 0.03 --weight_decay 0. --num_train_epochs 2 --output_dir /output/ --with_tracking --report_to tensorboard --logging_steps 1 --reduce_loss sum using open-instruct
we encounter this error on the first step of finetuning:
2024-07-23T21:19:48.544516135Z /opt/miniconda3/lib/python3.10/site-packages/transformers/data/data_collator.py:656: UserWarning: Creating a tensor from a list of numpy.ndarrays is extremely slow. Please consider converting the list to a single numpy.ndarray with numpy.array() before converting to a tensor. (Triggered internally at ../torch/csrc/utils/tensor_new.cpp:274.)
2024-07-23T21:19:48.544518524Z batch["labels"] = torch.tensor(batch["labels"], dtype=torch.int64)
2024-07-23T21:19:49.155378393Z [rank2]: Traceback (most recent call last):
2024-07-23T21:19:49.155406373Z [rank2]: File "/stage/open_instruct/finetune.py", line 683, in
2024-07-23T21:19:49.155409168Z [rank2]: main()
2024-07-23T21:19:49.155410556Z [rank2]: File "/stage/open_instruct/finetune.py", line 602, in main
2024-07-23T21:19:49.155412476Z [rank2]: outputs = model(**batch, use_cache=False)
2024-07-23T21:19:49.155413980Z [rank2]: File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-07-23T21:19:49.155415839Z [rank2]: return self._call_impl(*args, **kwargs)
2024-07-23T21:19:49.155417058Z [rank2]: File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
2024-07-23T21:19:49.155418501Z [rank2]: return forward_call(*args, **kwargs)
2024-07-23T21:19:49.155419655Z [rank2]: File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
2024-07-23T21:19:49.155421076Z [rank2]: ret_val = func(*args, **kwargs)
2024-07-23T21:19:49.155422228Z [rank2]: File "/opt/miniconda3/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1846, in forward
2024-07-23T21:19:49.155423640Z [rank2]: loss = self.module(*inputs, **kwargs)
2024-07-23T21:19:49.155424827Z [rank2]: File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
2024-07-23T21:19:49.155440561Z [rank2]: return self._call_impl(*args, **kwargs)
2024-07-23T21:19:49.155441869Z [rank2]: File "/opt/miniconda3/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1582, in _call_impl
2024-07-23T21:19:49.155443280Z [rank2]: result = forward_call(*args, **kwargs)
2024-07-23T21:19:49.155444498Z [rank2]: File "/opt/miniconda3/lib/python3.10/site-packages/transformers/models/llama/modeling_llama.py", line 1168, in forward
2024-07-23T21:19:49.155446074Z [rank2]: shift_logits = shift_logits.view(-1, self.config.vocab_size)
2024-07-23T21:19:49.155447329Z [rank2]: RuntimeError: shape '[-1, 0]' is invalid for input of size 41041920
after updating to transformers 4.43.1 to support Llama 3.1 finetuning. Any idea what's going on? We're not sure if other packages need to be updated, if this is a known issue, or something else.
Expected behavior
Llama 3.1 finetuning to run successfully