-
Notifications
You must be signed in to change notification settings - Fork 5.8k
[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
bd6366e
to
ed69d5f
Compare
You must have one RD (phlrain or luotao1 or Aurelius84) approval |
Sorry to inform you that ed69d5f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
1fe672f
to
434c7eb
Compare
@@ -111,7 +111,7 @@ def _build_pod_with_args(self): | |||
"POD_IP": self.ctx.node.ip, | |||
} | |||
e.update(_gloo_envs) | |||
log_file = f"workerlog.{i}" | |||
log_file = f"workerlog.{i + trainer_rank_offset}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个log 命名格式必须改嘛?
集群侧部分log 监控分析程序依赖 log 的结尾是 [0 ~ 7], 如果这块需要更新需要两边同时对齐
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多节点的时候,这个log会叠再一起,导致除错看不懂,所以想说改一下这个编号。单节点的行为应该是与以前一样的,不影响
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个CI-coverage的覆蓋不夠多,看起来蛮多是在这workerlog的部分,应该原本就没测
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianshuo78520a ,
上述CI-coverage不夠的問題也是找你處理嗎?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tianshuo78520a , 上述CI-coverage不夠的問題也是找你處理嗎?
已经处理
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多节点的时候,这个log会叠再一起,导致除错看不懂,所以想说改一下这个编号。单节点的行为应该是与以前一样的,不影响
这里的多节点怎么理解?这个改动会改变日志保存行为吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
多节点是指多台机器的时候。多台机器多卡的时候每台机器的第0个device都会有一样的log编号,导致多个机器的log叠再一起
Sorry to inform you that 434c7eb's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
434c7eb
to
45b9ed9
Compare
@sneaxiy , @JZ-LIANG , @ForFishes , @tianshuo78520a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
… PipelineParallelMicroStepCallback (PaddlePaddle#65634) * CUDAGraph: PP hook and workerlog.rank * fix header * change logging.info to print * fix pp hook * fix logger --------- Co-authored-by: Frank Lin (Engrg-Hardware 1) <fralin@nvidia.com>
PR Category
Distributed Strategy
PR Types
New features
Description
This PR introduces significant enhancements and fixes to improve the functionality and debugging capabilities of pipeline parallel training in PaddlePaddle. The primary addition is the
PipelineParallelMicroStepCallback
, which allows for better management of hooks within pipeline parallel processes. This update is crucial for supporting CUDA Graph pipeline parallel training and includes several other improvements.Key Features
PipelineParallelMicroStepCallback:
forward_begin
,forward_end
,backward_begin
, andbackward_end
.PipelineParallel
, where layers are divided into multiple chunks.Support for CUDA Graph Pipeline Parallel Training:
Worker Log Adjustment:
Debug Tools and Fixes:
Minor Fixes
For more information, please check NVIDIA/TransformerEngine#957
Check #65092