[BUG] Tensors are on different devices when model.step() #5422
Describe the bug
The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.
My code
My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3
Thanks very much for your precious time!