Skip to content

[bug]AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group' #525

Open
@qingchu123

Description

@qingchu123

my training environment is a docker image pulled from deepspeed/deepspeed:v072_torch112_cu117
and i run it with docker run -it --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --network train-net --name fuyx-work -v /home/fuyx/big_disk_1000/DeepSpeedExamples/applications/DeepSpeed-Chat:/root/DeepSpeed-Chat b1d in a overlay docker network.
then after i complete The previous two steps,i run the last step by python train.py --actor-model facebook/opt-1.3b --reward-model facebook/opt-350m --deployment-type multi_node --step 3
my hostfile is

jes-work slots=1
fuyx-work slots=1

and i get this error

jes-work: Traceback (most recent call last):
jes-work:   File "main.py", line 522, in <module>
jes-work:     main()
jes-work:   File "main.py", line 390, in main
jes-work:     rlhf_engine = DeepSpeedRLHFEngine(
jes-work:   File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 48, in __init__
jes-work:     self.actor = self._init_actor(
jes-work:   File "/root/DeepSpeed-Chat/training/step3_rlhf_finetuning/rlhf_engine.py", line 119, in _init_actor
jes-work:     actor_engine, *_ = deepspeed.initialize(model=actor_model,
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/__init__.py", line 153, in initialize
jes-work:     engine = DeepSpeedHybridEngine(args=args,
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 52, in __init__
jes-work:     self.create_inference_module()
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 359, in create_inference_module
jes-work:     self.create_inference_containers(self.module)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 308, in create_inference_containers
jes-work:     self.create_inference_containers(child, layer_id=layer_id)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 288, in create_inference_containers
jes-work:     self._inference_containers.append(self.inference_policies[child.__class__][0](
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/hybrid_engine.py", line 107, in new_inference_container
jes-work:     _container.set_tensor_parallel_config(self._config.hybrid_engine.inference_tp_size, self.mp_group)
jes-work:   File "/opt/conda/lib/python3.8/site-packages/deepspeed/runtime/engine.py", line 461, in __getattr__
jes-work:     raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
jes-work: AttributeError: 'DeepSpeedHybridEngine' object has no attribute 'mp_group'

the deepspeed command is below,i don't have any change except reduce some batch size to slow the gpu's pressure:

deepspeed --master_port 12346\
    --hostfile=hostfile \
     main.py \
   --data_path Dahoas/rm-static \
   --data_split 2,4,4 \
   --actor_model_name_or_path $ACTOR_MODEL_PATH \
   --critic_model_name_or_path $CRITIC_MODEL_PATH \
   --num_padding_at_beginning 1 \
   --per_device_train_batch_size 1 \
   --per_device_mini_train_batch_size 1 \
   --generation_batch_numbers 1 \
   --ppo_epochs 1 \
   --max_answer_seq_len 256 \
   --max_prompt_seq_len 256 \
   --actor_learning_rate ${Actor_Lr} \
   --critic_learning_rate ${Critic_Lr} \
   --actor_weight_decay 0.1 \
   --critic_weight_decay 0.1 \
   --num_train_epochs 1 \
   --lr_scheduler_type cosine \
   --gradient_accumulation_steps 1 \
   --num_warmup_steps 100 \
   --deepspeed --seed 1234 \
   --enable_hybrid_engine \
   --inference_tp_size 8 \
   --tp_gather_partition_size 4 \
   --actor_zero_stage $ACTOR_ZERO_STAGE \
   --critic_zero_stage $CRITIC_ZERO_STAGE \
   --actor_gradient_checkpointing \
   --disable_actor_dropout \
   --actor_lora_dim 128 \
   --actor_lora_module_name decoder.layers. \
   --output_dir $OUTPUT \
    &> $OUTPUT/training.log

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingdeespeed chatDeepSpeed Chathybrid enginerelating to the hybrid engine

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions