Skip to content

[BUG] Tensors are on different devices when model.step() #5422

Closed
@yuezhao238

Description

Describe the bug
The behavior is same to what is reported in #4565 . When model.step() with zero3, Tensors are on different devices. I modified stage3.py#L2117 to self.fp32_partitioned_groups_flat[sub_group_id].grad.mul_(1. / combined_scale.item()), and it seems to be right. I am not sure whether this is a bug or I made a mistake in my training script.

My code
My training code is taken from here: https://github.com/liucongg/ChatGLM-Finetuning/blob/master/train.py, and I run this with CUDA_VISIBLE_DEVICES=0 deepspeed train.py --train_path data/spo_0.json --model_name_or_path ./opensource --per_device_train_batch_size 8 --max_len 4096 --max_src_len 2048 --learning_rate 1e-4 --weight_decay 0.1 --num_train_epochs 2 --gradient_accumulation_step 4 --warmup_ratio 0.1 --mode glm3 --train_type all --seed 1234 --ds_file ds_zero3_offload.json --gradient_checkpointing --show_loss_step 10 --output_dir ./output-glm3

Thanks very much for your precious time!

Activity

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions