[HybridParallel]Add gpt example using dygraph hybrid parallel #986

ForFishes · 2021-09-08T07:02:08Z

PR types

Others

PR changes

Others

Description

[HybridParallel]Add gpt example using dygraph hybrid parallel
添加动态图混合并行的gpt3代码示例。精度和单卡精度对齐。

globalsize=256，8机64卡性能数据：

模型配置(layer, hidden_size,head)	精度	策略配置	Megatron	paddle动态图	paddle静态图 (Fake 数据）
7B_16_6144_128	fp32	dp1_pp8_mp8	11647	11485(-1.4%)	11593
	fp16	dp1_pp8_mp8	47801	47300(-1%)	40644
14B_32_6144_128	fp32	dp1_pp8_mp8	OOM	OOM	6634
		dp1_pp8_mp8_recompute	5156	5865(+13.7%)	5787
14B_32_6144_128	fp16	dp1_pp8_mp8	27911	27429(-1.8%)	23261
		dp1_pp8_mp8_recompute	21751	22592(+3.8%)	17864

ZHUI

很棒的工作👍🏻

ZHUI · 2021-09-08T11:55:46Z

examples/language_model/gpt-3/dygraph/args.py

+    # )
+
+    parser.add_argument(
+        "--local_batch_size",


Why local_batch_size?

global_batch_size 应该是等于local_batch_size * dp_degree
mircro_batch 是指pp训练为了流水线的性能，将local_batch_size切分位多个小batch。local_batch_size = mircro_batch * accumulate_step
global_batch_size = local_batch_size * dp_degree

global_batch_size = micro_batch_size * accumulate_step * dp_degree (* sharding_degree)
其实这里设置成 global_batch_size 也可以算出，accumulate_step 自动去做 accumulate。

设置 global_batch_size 的一个好处是，方便恢复继续训练。只要global_batch_size一致，无论单机、多机，相同step保存的状态是一样的。

嗯嗯，是的。已修改。添加了global_batch_size,并对这些关系做了判断，可以指定global_batch_size而不需要指定local_batch_size了。

ZHUI · 2021-09-08T12:02:50Z

examples/language_model/gpt-3/dygraph/dataset.py

+            num_samples_ = sample_idx.shape[0] - 1
+        shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1,
+                                         np_rng)
+        if paddle.distributed.get_rank() % 8 == 0:


8 for hard code here. you can pass local_rank as #930

ok， fix it

ZHUI · 2021-09-08T12:06:43Z

examples/language_model/gpt-3/dygraph/dataset.py

+            eos_id=eos_id,
+            seed=args.seed)
+
+        batch_sampler = paddle.io.DistributedBatchSampler(


If we don't use shuffle, here can use

from paddlenlp.utils.batch_sampler import DistributedBatchSampler

As https://github.com/PaddlePaddle/PaddleNLP/pull/930/files

paddlenlp.utils.batch_sampler could save more memory.

done， thx

ZHUI · 2021-09-08T12:08:23Z

examples/language_model/gpt-3/dygraph/dataset.py

+    sample_idx_filename = _filename + '_sample_idx.npy'
+    shuffle_idx_filename = _filename + '_shuffle_idx.npy'
+
+    # support multi-machines


How about save the seed and recover seed later?
多机我可能不太好验证，要不你先改一下这个seed的方案试试？
怕后面把你的东西改崩了

ZHUI · 2021-09-08T12:11:47Z

examples/language_model/gpt-3/dygraph/run.sh

@@ -0,0 +1,57 @@
+#wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/train.data.json_ids.npz


ZHUI · 2021-09-08T12:12:06Z

examples/language_model/gpt-3/dygraph/run.sh

+#mkdir data
+#mv train.data.json_ids.npz data
+
+export DATA_DIR=./data


ZHUI · 2021-09-08T12:16:00Z

examples/language_model/gpt-3/dygraph/run_pretrain.py

+        "micro_batch_size": args.micro_batch_size
+    }
+
+    fleet.init(is_collective=True, strategy=strategy)


这一大堆的种子设置，可否用封装成一个辅助工具函数，方便复用？

ZHUI · 2021-09-08T12:21:49Z

examples/language_model/gpt-3/dygraph/run_pretrain.py

+        args.output_dir, "train_log",
+        "{}_globalbsz_{}_amp_{}_recompute_{}_card_{}".format(
+            args.model_name_or_path, default_global_batch_size, args.use_amp,
+            False, worker_index).lower())


前面已经设置worker_index = dp_rank 对于属于同一dp的mp，这里的日志会不会写入两次？

是的。我修复一下。

ZHUI · 2021-09-08T12:23:07Z

examples/language_model/gpt-3/dygraph/run_pretrain.py

+                            model_to_save = model._layers
+                        else:
+                            model_to_save = model
+                        logger.info("Save model to %s" % output_dir)


MP 情形下的 save。是否有问题？mp的每个部分应该需要save成不同名字？

是的，需要修复。保存到不同的名字/文件夹。

ZHUI · 2021-09-09T03:24:36Z

examples/language_model/gpt-3/dygraph/args.py

+    # )
+
+    parser.add_argument(
+        "--local_batch_size",


global_batch_size = micro_batch_size * accumulate_step * dp_degree (* sharding_degree)
其实这里设置成 global_batch_size 也可以算出，accumulate_step 自动去做 accumulate。

设置 global_batch_size 的一个好处是，方便恢复继续训练。只要global_batch_size一致，无论单机、多机，相同step保存的状态是一样的。

ZHUI · 2021-09-09T03:25:55Z

examples/language_model/gpt-3/dygraph/args.py

+    parser.add_argument(
+        "--scale_loss",
+        type=float,
+        default=128,


换成实际使用的默认值吧

ZHUI · 2021-09-09T03:27:15Z

examples/language_model/gpt-3/dygraph/run.sh

+
+# just for performance
+
+#nsys profile --stats=true -t cuda python -m paddle.distributed.launch --log_dir dp2_pp1_mp4 --gpus "0,1,2,3,4,5,6,7" run_pretrain.py \


ZHUI · 2021-09-09T03:35:33Z

examples/language_model/gpt-3/dygraph/run_pretrain.py

+            test_data_loader = test_data_loader()
+
+            for step, batch in enumerate(train_data_loader()):
+                global_step += 1


如果有 accumulate_step 的话，这里的 global_step 应该经过accumulate_step 再加一。
对应的 lr_scheduler.step() 应该是每次 global_step 变化的时候，再更新。

额，现在的逻辑是每次读local_batch_size,然后框架再去切分microbatchsize，目前这个是和静态图一致

ZHUI · 2021-09-09T03:36:57Z

examples/language_model/gpt-3/dygraph/run_pretrain.py

+                        optimizer.step()
+
+                    if lr_scheduler is not None:
+                        lr_scheduler.step()


如上，lr_scheduler 这里应该是，跟随 global_step 去调整，不随着 accumulate_step 影响，

ZHUI · 2021-09-09T03:55:57Z

examples/language_model/gpt-3/dygraph/run.sh

+
+rm -rf dp2_pp2_mp2
+export NCCL_DEBUG=INFO
+#export NCCL_DEBUG_SUBSYS=ALL


ZHUI · 2021-09-09T03:56:02Z

examples/language_model/gpt-3/dygraph/run.sh

@@ -0,0 +1,51 @@
+export PYTHONPATH=$PYTHONPATH:../../../../


ZHUI

LGTM

ZeyuChen

LGTM!

ForFishes and others added 3 commits September 8, 2021 06:58

add gpt3 in dygraph

0177b41

Merge branch 'develop' into add_gpt3_in_dygraph

e0eb4e8

fix run.sh

18c2ab4

ZHUI requested changes Sep 8, 2021

View reviewed changes

ForFishes and others added 3 commits September 8, 2021 21:23

Merge branch 'develop' into add_gpt3_in_dygraph

b6734f2

fix some code for dataset

2139d10

fix distributedbatchsample

e423ff1

ZHUI requested review from wawltor and ZeyuChen September 9, 2021 03:13

ForFishes added 2 commits September 9, 2021 03:42

fix args

3d00072

fix args

2be7fba

ZHUI requested changes Sep 9, 2021

View reviewed changes

fix code for sh

8c76485

ZHUI approved these changes Sep 9, 2021

View reviewed changes

ZeyuChen approved these changes Sep 9, 2021

View reviewed changes

Merge branch 'develop' into add_gpt3_in_dygraph

35329a0

ZeyuChen merged commit 714ca2c into PaddlePaddle:develop Sep 9, 2021

ForFishes deleted the add_gpt3_in_dygraph branch September 9, 2021 05:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[HybridParallel]Add gpt example using dygraph hybrid parallel #986

[HybridParallel]Add gpt example using dygraph hybrid parallel #986

ForFishes commented Sep 8, 2021 •

edited

Loading

ZHUI left a comment

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 9, 2021

ForFishes Sep 9, 2021

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 8, 2021

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 8, 2021

ForFishes Sep 8, 2021

ZHUI Sep 9, 2021

ZHUI Sep 9, 2021

ForFishes Sep 9, 2021

ZHUI Sep 9, 2021

ZHUI Sep 9, 2021

ForFishes Sep 9, 2021

ZHUI Sep 9, 2021

ZHUI Sep 9, 2021

ForFishes Sep 9, 2021

ZHUI Sep 9, 2021

ForFishes Sep 9, 2021

ZHUI left a comment

ZeyuChen left a comment

		@@ -0,0 +1,57 @@
		#wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/train.data.json_ids.npz


		# just for performance

		#nsys profile --stats=true -t cuda python -m paddle.distributed.launch --log_dir dp2_pp1_mp4 --gpus "0,1,2,3,4,5,6,7" run_pretrain.py \

[HybridParallel]Add gpt example using dygraph hybrid parallel #986

[HybridParallel]Add gpt example using dygraph hybrid parallel #986

Conversation

ForFishes commented Sep 8, 2021 • edited Loading

PR types

PR changes

Description

ZHUI left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ZHUI left a comment

Choose a reason for hiding this comment

ZeyuChen left a comment

Choose a reason for hiding this comment

ForFishes commented Sep 8, 2021 •

edited

Loading