[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634

eee4017 · 2024-07-02T09:00:47Z

PR Category

Distributed Strategy

PR Types

New features

Description

This PR introduces significant enhancements and fixes to improve the functionality and debugging capabilities of pipeline parallel training in PaddlePaddle. The primary addition is the PipelineParallelMicroStepCallback, which allows for better management of hooks within pipeline parallel processes. This update is crucial for supporting CUDA Graph pipeline parallel training and includes several other improvements.

Key Features

PipelineParallelMicroStepCallback:
- This new feature facilitates enhanced management of hooks within pipeline parallel processes.
- It allows for registering callbacks at specific pipeline stages: forward_begin, forward_end, backward_begin, and backward_end.
- This functionality is particularly important for PipelineParallel, where layers are divided into multiple chunks.
- The addition supports various tasks, such as logging, monitoring, and dynamic parameter adjustments during pipeline execution.
- This feature aligns with PaddlePaddle's design principles and addresses the specific needs of pipeline parallelism.
Support for CUDA Graph Pipeline Parallel Training:
- The update is essential for enabling efficient pipeline parallel training with CUDA Graph.
- It allows training of large models, such as GPT-3 175B, on 64 H100 GPUs using hybrid parallelism (Pipeline Parallel + Tensor Parallel + Sequence Parallel), achieving a 1.18x speedup in training performance.
Worker Log Adjustment:
- Updated the worker log to be ranked instead of being node-specific. This change ensures that the worker logs of each node do not collapse into a single log, facilitating better debugging and clarity.
Debug Tools and Fixes:
- Added several debug tools and fixed issues in the CUDA graphed layer, enhancing the overall debugging experience and reliability of CUDA Graph pipeline training.

Minor Fixes

Adjusted the worker log to rank-based logging.
Improved debug tools and fixed issues in the CUDA graphed layer.

For more information, please check NVIDIA/TransformerEngine#957
Check #65092

paddle-bot · 2024-07-02T09:00:52Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

jeng1220 · 2024-07-10T07:12:32Z

You must have one RD (phlrain or luotao1 or Aurelius84) approval

paddle-ci-bot · 2024-07-12T03:06:16Z

Sorry to inform you that ed69d5f's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

JZ-LIANG · 2024-07-17T11:35:00Z

python/paddle/distributed/launch/controllers/ps.py

@@ -111,7 +111,7 @@ def _build_pod_with_args(self):
                "POD_IP": self.ctx.node.ip,
            }
            e.update(_gloo_envs)
-            log_file = f"workerlog.{i}"
+            log_file = f"workerlog.{i + trainer_rank_offset}"


这个log 命名格式必须改嘛？
集群侧部分log 监控分析程序依赖 log 的结尾是 [0 ~ 7]，如果这块需要更新需要两边同时对齐

多节点的时候，这个log会叠再一起，导致除错看不懂，所以想说改一下这个编号。单节点的行为应该是与以前一样的，不影响

这个CI-coverage的覆蓋不夠多，看起来蛮多是在这workerlog的部分，应该原本就没测

@tianshuo78520a ,
上述CI-coverage不夠的問題也是找你處理嗎?

@tianshuo78520a , 上述CI-coverage不夠的問題也是找你處理嗎?

已经处理

多节点的时候，这个log会叠再一起，导致除错看不懂，所以想说改一下这个编号。单节点的行为应该是与以前一样的，不影响

这里的多节点怎么理解？这个改动会改变日志保存行为吗？

多节点是指多台机器的时候。多台机器多卡的时候每台机器的第0个device都会有一样的log编号，导致多个机器的log叠再一起

paddle-ci-bot · 2024-07-23T03:02:46Z

Sorry to inform you that 434c7eb's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually.

jeng1220 · 2024-07-27T13:50:53Z

@sneaxiy , @JZ-LIANG , @ForFishes , @tianshuo78520a
所有測試都通過了，請問可以合併這PR了嗎?

ForFishes

LGTM

… PipelineParallelMicroStepCallback (PaddlePaddle#65634) * CUDAGraph: PP hook and workerlog.rank * fix header * change logging.info to print * fix pp hook * fix logger --------- Co-authored-by: Frank Lin (Engrg-Hardware 1) <fralin@nvidia.com>

paddle-bot bot added the contributor External developers label Jul 2, 2024

jeng1220 added the NVIDIA label Jul 2, 2024

onecatcn assigned JZ-LIANG Jul 4, 2024

eee4017 force-pushed the cudagraph_175b_github_submit branch from bd6366e to ed69d5f Compare July 4, 2024 04:39

eee4017 force-pushed the cudagraph_175b_github_submit branch 2 times, most recently from 1fe672f to 434c7eb Compare July 15, 2024 05:34

JZ-LIANG reviewed Jul 17, 2024

View reviewed changes

Frank Lin (Engrg-Hardware 1) and others added 5 commits July 24, 2024 15:56

CUDAGraph: PP hook and workerlog.rank

508aef3

fix header

46e7638

change logging.info to print

591c6a8

fix pp hook

7f6c472

fix logger

45b9ed9

eee4017 force-pushed the cudagraph_175b_github_submit branch from 434c7eb to 45b9ed9 Compare July 24, 2024 15:56

ForFishes approved these changes Jul 29, 2024

View reviewed changes

ForFishes merged commit c430ee4 into PaddlePaddle:develop Jul 29, 2024
31 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634

[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634

Uh oh!

eee4017 commented Jul 2, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Jul 2, 2024

Uh oh!

jeng1220 commented Jul 10, 2024

Uh oh!

paddle-ci-bot bot commented Jul 12, 2024

Uh oh!

JZ-LIANG Jul 17, 2024

Uh oh!

eee4017 Jul 18, 2024 •

edited by jeng1220

Loading

Uh oh!

eee4017 Jul 18, 2024 •

edited by jeng1220

Loading

Uh oh!

jeng1220 Jul 18, 2024

Uh oh!

tianshuo78520a Jul 18, 2024

Uh oh!

ForFishes Jul 25, 2024

Uh oh!

eee4017 Jul 25, 2024 •

edited

Loading

Uh oh!

paddle-ci-bot bot commented Jul 23, 2024

Uh oh!

jeng1220 commented Jul 27, 2024

Uh oh!

ForFishes left a comment

Uh oh!

Uh oh!

Uh oh!

[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634

[CUDAGraph] GPT3-175B Pipeline Parallel Training with CUDAGraph using PipelineParallelMicroStepCallback #65634

Uh oh!

Conversation

eee4017 commented Jul 2, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

Uh oh!

paddle-bot bot commented Jul 2, 2024

Uh oh!

jeng1220 commented Jul 10, 2024

Uh oh!

paddle-ci-bot bot commented Jul 12, 2024

Uh oh!

JZ-LIANG Jul 17, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Jul 18, 2024 • edited by jeng1220 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

eee4017 Jul 18, 2024 • edited by jeng1220 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jeng1220 Jul 18, 2024

Choose a reason for hiding this comment

Uh oh!

tianshuo78520a Jul 18, 2024

Choose a reason for hiding this comment

Uh oh!

ForFishes Jul 25, 2024

Choose a reason for hiding this comment

Uh oh!

eee4017 Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

paddle-ci-bot bot commented Jul 23, 2024

Uh oh!

jeng1220 commented Jul 27, 2024

Uh oh!

ForFishes left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

eee4017 commented Jul 2, 2024 •

edited

Loading

eee4017 Jul 18, 2024 •

edited by jeng1220

Loading

eee4017 Jul 18, 2024 •

edited by jeng1220

Loading

eee4017 Jul 25, 2024 •

edited

Loading