Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HybridParallel]Add gpt example using dygraph hybrid parallel #986

Merged
merged 10 commits into from
Sep 9, 2021

Conversation

ForFishes
Copy link
Member

@ForFishes ForFishes commented Sep 8, 2021

PR types

Others

PR changes

Others

Description

[HybridParallel]Add gpt example using dygraph hybrid parallel
添加动态图混合并行的gpt3代码示例。精度和单卡精度对齐。

globalsize=256,8机64卡性能数据:

模型配置(layer, hidden_size,head) 精度 策略配置 Megatron paddle动态图 paddle静态图 (Fake 数据)
7B_16_6144_128 fp32 dp1_pp8_mp8 11647 11485(-1.4%) 11593
  fp16 dp1_pp8_mp8 47801 47300(-1%) 40644
14B_32_6144_128 fp32 dp1_pp8_mp8 OOM OOM 6634
    dp1_pp8_mp8_recompute 5156 5865(+13.7%) 5787
14B_32_6144_128 fp16 dp1_pp8_mp8 27911 27429(-1.8%) 23261
    dp1_pp8_mp8_recompute 21751 22592(+3.8%) 17864

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

很棒的工作👍🏻

# )

parser.add_argument(
"--local_batch_size",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why local_batch_size?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_batch_size 应该是等于local_batch_size * dp_degree
mircro_batch 是指pp训练为了流水线的性能,将local_batch_size切分位多个小batch。local_batch_size = mircro_batch * accumulate_step
global_batch_size = local_batch_size * dp_degree

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_batch_size = micro_batch_size * accumulate_step * dp_degree (* sharding_degree)
其实这里设置成 global_batch_size 也可以算出,accumulate_step 自动去做 accumulate。

设置 global_batch_size 的一个好处是,方便恢复继续训练。只要global_batch_size一致,无论单机、多机,相同step保存的状态是一样的。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

嗯嗯,是的。已修改。添加了global_batch_size,并对这些关系做了判断,可以指定global_batch_size而不需要指定local_batch_size了。

num_samples_ = sample_idx.shape[0] - 1
shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1,
np_rng)
if paddle.distributed.get_rank() % 8 == 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 for hard code here. you can pass local_rank as #930

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, fix it

eos_id=eos_id,
seed=args.seed)

batch_sampler = paddle.io.DistributedBatchSampler(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we don't use shuffle, here can use

from paddlenlp.utils.batch_sampler import DistributedBatchSampler

As https://github.com/PaddlePaddle/PaddleNLP/pull/930/files

paddlenlp.utils.batch_sampler could save more memory.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done, thx

sample_idx_filename = _filename + '_sample_idx.npy'
shuffle_idx_filename = _filename + '_shuffle_idx.npy'

# support multi-machines
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about save the seed and recover seed later?
多机我可能不太好验证,要不你先改一下这个seed的方案试试?
怕后面把你的东西改崩了

@@ -0,0 +1,57 @@
#wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/train.data.json_ids.npz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

#mkdir data
#mv train.data.json_ids.npz data

export DATA_DIR=./data
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

"micro_batch_size": args.micro_batch_size
}

fleet.init(is_collective=True, strategy=strategy)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这一大堆的种子设置,可否用封装成一个辅助工具函数,方便复用?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done

args.output_dir, "train_log",
"{}_globalbsz_{}_amp_{}_recompute_{}_card_{}".format(
args.model_name_or_path, default_global_batch_size, args.use_amp,
False, worker_index).lower())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

前面已经设置worker_index = dp_rank 对于属于同一dp的mp,这里的日志会不会写入两次?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的。我修复一下。

model_to_save = model._layers
else:
model_to_save = model
logger.info("Save model to %s" % output_dir)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MP 情形下的 save。是否有问题?mp的每个部分应该需要save成不同名字?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

是的,需要修复。保存到不同的名字/文件夹。

# )

parser.add_argument(
"--local_batch_size",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

global_batch_size = micro_batch_size * accumulate_step * dp_degree (* sharding_degree)
其实这里设置成 global_batch_size 也可以算出,accumulate_step 自动去做 accumulate。

设置 global_batch_size 的一个好处是,方便恢复继续训练。只要global_batch_size一致,无论单机、多机,相同step保存的状态是一样的。

parser.add_argument(
"--scale_loss",
type=float,
default=128,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

换成实际使用的默认值吧

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done


# just for performance

#nsys profile --stats=true -t cuda python -m paddle.distributed.launch --log_dir dp2_pp1_mp4 --gpus "0,1,2,3,4,5,6,7" run_pretrain.py \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

test_data_loader = test_data_loader()

for step, batch in enumerate(train_data_loader()):
global_step += 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果有 accumulate_step 的话,这里的 global_step 应该经过accumulate_step 再加一。
对应的 lr_scheduler.step() 应该是 每次 global_step 变化的时候,再更新。

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

额,现在的逻辑是每次读local_batch_size,然后框架再去切分microbatchsize,目前这个是和静态图一致

optimizer.step()

if lr_scheduler is not None:
lr_scheduler.step()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如上,lr_scheduler 这里应该是,跟随 global_step 去调整,不随着 accumulate_step 影响,


rm -rf dp2_pp2_mp2
export NCCL_DEBUG=INFO
#export NCCL_DEBUG_SUBSYS=ALL
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,51 @@
export PYTHONPATH=$PYTHONPATH:../../../../
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Collaborator

@ZHUI ZHUI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@ZeyuChen ZeyuChen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@ZeyuChen ZeyuChen merged commit 714ca2c into PaddlePaddle:develop Sep 9, 2021
@ForFishes ForFishes deleted the add_gpt3_in_dygraph branch September 9, 2021 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants