Skip to content

静态图 GradientMergeOptimizer 与 main_program.clone(for_test=True) 冲突  #43571

Open
@ZHUI

Description

bug描述 Describe the Bug

[2022-06-16 11:27:32,284] [    INFO] - The training meta optimizer is/are ['GradientMergeOptimizer', 'AMPOptimizer']
W0616 11:27:33.298504  6074 gpu_context.cc:278] Please NOTE: device: 1, GPU Compute Capability: 7.0, Driver API Version: 10.2, Runtime API Vers
ion: 10.2
W0616 11:27:33.305351  6074 gpu_context.cc:306] device: 1, cuDNN Version: 7.6.
Traceback (most recent call last):
  File "run_pretrain_static.py", line 677, in <module>
    do_train(config)
  File "run_pretrain_static.py", line 489, in do_train
    test_program = main_program.clone(for_test=True)
  File "/ssd2/zhonghui03/anaconda3/envs/py37/lib/python3.7/site-packages/paddle/fluid/framework.py", line 5419, in clone
    self.desc)
RuntimeError: (NotFound) The origin sub block id is not found in pruned_progin_block_id_map
  [Hint: Expected sub_idx != -1, but received sub_idx:-1 == -1:-1.] (at /paddle/paddle/fluid/framework/prune.cc:511)

INFO 2022-06-16 11:27:48,019 launch_utils.py:343] terminate all the procs
INFO 2022-06-16 11:27:48,019 launch_utils.py:343] terminate all the procs
ERROR 2022-06-16 11:27:48,019 launch_utils.py:642] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check
its log.
ERROR 2022-06-16 11:27:48,019 launch_utils.py:642] ABORT!!! Out of all 1 trainers, the trainer process with rank=[0] was aborted. Please check
its log.

INFO 2022-06-16 11:27:52,023 launch_utils.py:343] terminate all the procs
INFO 2022-06-16 11:27:52,023 launch_utils.py:343] terminate all the procs
INFO 2022-06-16 11:27:52,023 launch.py:402] Local processes completed.
INFO 2022-06-16 11:27:52,023 launch.py:402] Local processes completed.

其他补充信息 Additional Supplementary Information

代码位置
https://github.com/PaddlePaddle/PaddleNLP/tree/develop/applications/neural_search/recall/domain_adaptive_pretraining
复现脚本

python -u  -m paddle.distributed.launch \
    --gpus "1" \
    --log_dir "output/$task_name/log" \
    run_pretrain_static.py \
    --model_type "ernie" \
    --model_name_or_path "ernie-1.0-base-zh" \
    --input_dir "./data" \
    --split 8,1,1 \
    --output_dir "output/$task_name" \
    --max_seq_len 128 \
    --micro_batch_size 32 \
    --global_batch_size 64 \
    --sharding_degree 1\
    --dp_degree 1 \
    --use_sharding false \
    --use_amp true \
    --use_recompute false \
    --max_lr 0.0001 \
    --min_lr 0.00001 \
    --max_steps 2000 \
    --save_steps 100000 \
    --checkpoint_steps 5000 \
    --decay_steps 1980 \
    --weight_decay 0.01\
    --warmup_rate 0.01 \
    --grad_clip 1.0 \
    --num_workers 2 \
    --logging_freq 20\
    --eval_freq 1000 \
    --device "gpu"

global_batch_size 64 = micro_batch_size * 2, 代码里面自动使用了梯度累积。报错

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions