-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HybridParallel]Add gpt example using dygraph hybrid parallel #986
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
很棒的工作👍🏻
# ) | ||
|
||
parser.add_argument( | ||
"--local_batch_size", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why local_batch_size
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global_batch_size 应该是等于local_batch_size * dp_degree
mircro_batch 是指pp训练为了流水线的性能,将local_batch_size切分位多个小batch。local_batch_size = mircro_batch * accumulate_step
global_batch_size = local_batch_size * dp_degree
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global_batch_size = micro_batch_size * accumulate_step * dp_degree (* sharding_degree)
其实这里设置成 global_batch_size
也可以算出,accumulate_step
自动去做 accumulate。
设置 global_batch_size 的一个好处是,方便恢复继续训练。只要global_batch_size
一致,无论单机、多机,相同step保存的状态是一样的。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,是的。已修改。添加了global_batch_size,并对这些关系做了判断,可以指定global_batch_size而不需要指定local_batch_size了。
num_samples_ = sample_idx.shape[0] - 1 | ||
shuffle_idx = _build_shuffle_idx(num_samples_, sample_idx.shape[0] - 1, | ||
np_rng) | ||
if paddle.distributed.get_rank() % 8 == 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8
for hard code here. you can pass local_rank as #930
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, fix it
eos_id=eos_id, | ||
seed=args.seed) | ||
|
||
batch_sampler = paddle.io.DistributedBatchSampler( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we don't use shuffle, here can use
from paddlenlp.utils.batch_sampler import DistributedBatchSampler
As https://github.com/PaddlePaddle/PaddleNLP/pull/930/files
paddlenlp.utils.batch_sampler
could save more memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done, thx
sample_idx_filename = _filename + '_sample_idx.npy' | ||
shuffle_idx_filename = _filename + '_shuffle_idx.npy' | ||
|
||
# support multi-machines |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about save the seed and recover seed later?
多机我可能不太好验证,要不你先改一下这个seed的方案试试?
怕后面把你的东西改崩了
@@ -0,0 +1,57 @@ | |||
#wget https://paddlenlp.bj.bcebos.com/models/transformers/gpt/train.data.json_ids.npz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
#mkdir data | ||
#mv train.data.json_ids.npz data | ||
|
||
export DATA_DIR=./data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
"micro_batch_size": args.micro_batch_size | ||
} | ||
|
||
fleet.init(is_collective=True, strategy=strategy) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这一大堆的种子设置,可否用封装成一个辅助工具函数,方便复用?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
args.output_dir, "train_log", | ||
"{}_globalbsz_{}_amp_{}_recompute_{}_card_{}".format( | ||
args.model_name_or_path, default_global_batch_size, args.use_amp, | ||
False, worker_index).lower()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
前面已经设置worker_index = dp_rank
对于属于同一dp的mp,这里的日志会不会写入两次?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的。我修复一下。
model_to_save = model._layers | ||
else: | ||
model_to_save = model | ||
logger.info("Save model to %s" % output_dir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MP 情形下的 save。是否有问题?mp的每个部分应该需要save成不同名字?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
是的,需要修复。保存到不同的名字/文件夹。
# ) | ||
|
||
parser.add_argument( | ||
"--local_batch_size", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
global_batch_size = micro_batch_size * accumulate_step * dp_degree (* sharding_degree)
其实这里设置成 global_batch_size
也可以算出,accumulate_step
自动去做 accumulate。
设置 global_batch_size 的一个好处是,方便恢复继续训练。只要global_batch_size
一致,无论单机、多机,相同step保存的状态是一样的。
parser.add_argument( | ||
"--scale_loss", | ||
type=float, | ||
default=128, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
换成实际使用的默认值吧
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, done
|
||
# just for performance | ||
|
||
#nsys profile --stats=true -t cuda python -m paddle.distributed.launch --log_dir dp2_pp1_mp4 --gpus "0,1,2,3,4,5,6,7" run_pretrain.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
test_data_loader = test_data_loader() | ||
|
||
for step, batch in enumerate(train_data_loader()): | ||
global_step += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如果有 accumulate_step 的话,这里的 global_step 应该经过accumulate_step 再加一。
对应的 lr_scheduler.step() 应该是 每次 global_step 变化的时候,再更新。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
额,现在的逻辑是每次读local_batch_size,然后框架再去切分microbatchsize,目前这个是和静态图一致
optimizer.step() | ||
|
||
if lr_scheduler is not None: | ||
lr_scheduler.step() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
如上,lr_scheduler 这里应该是,跟随 global_step 去调整,不随着 accumulate_step 影响,
|
||
rm -rf dp2_pp2_mp2 | ||
export NCCL_DEBUG=INFO | ||
#export NCCL_DEBUG_SUBSYS=ALL |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,51 @@ | |||
export PYTHONPATH=$PYTHONPATH:../../../../ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
PR types
Others
PR changes
Others
Description
[HybridParallel]Add gpt example using dygraph hybrid parallel
添加动态图混合并行的gpt3代码示例。精度和单卡精度对齐。
globalsize=256,8机64卡性能数据: