Skip to content

support sublayer and layer share weight with SharedLayerDesc#79336

Open
AlAuAu wants to merge 1 commit into
PaddlePaddle:developfrom
AlAuAu:support_mtp_reuse
Open

support sublayer and layer share weight with SharedLayerDesc#79336
AlAuAu wants to merge 1 commit into
PaddlePaddle:developfrom
AlAuAu:support_mtp_reuse

Conversation

@AlAuAu

@AlAuAu AlAuAu commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

PR Category

Distributed Strategy

PR Types

New features

Description

support sublayer and layer share weight with SharedLayerDesc

是否引起精度变化

@PaddlePaddle-bot PaddlePaddle-bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤖 Paddle-CI-Agent | pr_review | 2026-06-18 23:38:20

📋 Review 摘要

PR 概述:为 SharedLayerDesc 增加 submodule 与 layer 共享权重能力,并补充 pipeline shared weight 测试。
变更范围python/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.pytest/collective/fleet/
影响面 Tag[Distributed Strategy] [Communication Library]

问题

级别 文件 概述
🔴 Bug python/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py:1084 shared_weight_attr 返回 named_parameters() 时未同步更新 PP sync_param/sync_momentcolor 标记逻辑,开启同步配置会构造失败

📝 PR 规范检查

标题缺少官方 Tag;描述结构和精度变化字段符合模板。

标题建议(可直接复制):

  • [Distributed Strategy] support sublayer and layer share weight with SharedLayerDesc
PR 描述建议(点击展开,可直接复制)
### PR Category
Distributed Strategy

### PR Types
New features

### Description
Support sublayer and layer weight sharing with `SharedLayerDesc`: add `shared_submodule_weight_only`, allow `shared_weight_attr` to expose `named_parameters()` for a shared submodule, and add a collective pipeline test for the aliasing path.

### 是否引起精度变化

总体评价

变更方向符合 Distributed Strategy 场景,但当前实现只在 broadcast/allreduce 路径适配了 iterable 参数,没有同步适配 PP 参数/optimizer 状态同步的 color 标记逻辑。该问题会在已有的 sync_param/sync_moment 配置下直接阻断新功能使用,建议修复后再合入。

if isinstance(obj, paddle.Tensor):
obj.is_firstly_shared = True
else:
for _, param in obj:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Bug shared_submodule_weight_only 允许 shared_weight_attr 返回 named_parameters(),但 PP 参数/optimizer 状态同步路径仍按单个 Tensor 处理该属性。

这里把非 Tensor 的 obj 当成 (name, param) iterable 处理后,同一个 weight_attr 也会进入 _construct_shared_comm()sync_param/sync_moment 分支:shared_param = getattr(..., weight_attr); shared_param.color = ...Layer.named_parameters() 返回 generator,generator 不能设置 color,所以用户只要像现有 shared-weight 用例一样打开 strategy.hybrid_configs["pp_configs"].sync_param = Truesync_moment = True,模型构造阶段就会抛 AttributeError,后续 HybridParallelOptimizer 也拿不到 p.color["broadcast_group"]

建议修复方式:
_construct_shared_comm() 中设置 color 的逻辑也抽成与这里一致的参数迭代 helper:Tensor 走单个 param,named_parameters() 走每个 param,并给每个参数设置包含唯一参数名的 shared_weight_name(例如 f"{weight_attr}.{name}")。同时在新增测试里覆盖 sync_param=Truesync_moment=Trueshared_submodule_weight_only 场景。

@PaddlePaddle-bot

PaddlePaddle-bot commented Jun 21, 2026

Copy link
Copy Markdown

🤖 Paddle-CI-Agent | ci_status_monitor | 2026-06-24 00:45:23 UTC+08:00

CI报告基于以下代码生成(30分钟更新一次):
PR commit: b0840e60 | Merge base: 5d803b0 (branch: develop)


1 Required任务 : 45/48 通过

总执行(rerun次数) 总任务 ✅ 通过 ❌ 失败 ⏳ 运行中 ⏸️ 等待中 跳过
144(65) 79 76 3 0 0 0
任务 错误类型 置信度 日志
Coverage test PR问题 Job
Check 环境问题 Job
Fleet Unit test (single card) 不稳定问题 Job

2 失败详情

🔴 Coverage test — PR问题(置信度: 高)

错误类型: PR问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: 按根因聚类合并

用例 错误摘要
Coverage diff Python Diff Coverage 73.2%,低于 90% 阈值

关键日志:

Summary coverage rate:
  lines......: 78.6% (526 of 669 lines)
Assert Python Diff Coverage
expected >= 90.0 %, actual 73.2 %, failed
Coverage check failed, unit tests have all passed, please do not rerun
  • 根因摘要: 新增 pp_layers.py 分支覆盖不足
    日志显示 coverage diff 只提取到 /paddle/python/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py,Python diff coverage 为 73.2%,未达到 90% 阈值。本 PR 新增 shared_submodule_weight_only,并改动 shared weight 同步、梯度 allreduce、alias 和 build 分支,当前新增测试未覆盖足够新增行。

修复建议:

  1. python/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py 补充单测,覆盖 shared_submodule_weight_only=True_alias_shared_layer 成功路径、missing/shape mismatch assert 路径,以及 shared_weight_attr 返回 named_parameters() 的 broadcast/allreduce 分支。
  2. 重点关注 pp_layers.py:114-121pp_layers.py:741-819pp_layers.py:984-1026pp_layers.py:1075-1091;现有新增用例主要验证参数 alias 同对象,不足以满足 diff coverage。

关联变更: python/paddle/distributed/fleet/meta_parallel/parallel_layers/pp_layers.py

🔴 Check — 环境问题(置信度: 高)

错误类型: 环境问题 | 置信度: 高
分析器: 通用分析(fallback)
失败用例: 按根因聚类合并

用例 错误摘要
tools/CheckPRTemplate.py 读取 PR 信息时 httpx/httpcore ReadTimeout

关键日志:

httpcore.ReadTimeout: The read operation timed out
File "tools/CheckPRTemplate.py", line 325, in get_a_pull
    response = httpx.request(
httpx.ReadTimeout: The read operation timed out
##[error]Process completed with exit code 1.
  • 根因摘要: PR 模板检查请求 GitHub 超时
    失败发生在 CheckPRTemplate.py 调用 httpx.request 获取 PR 信息时,脚本尚未进入模板内容校验。日志没有显示 PR 模板字段校验失败,和本 PR 代码变更无关。

修复建议:

  1. 环境问题,请 rerun。若连续复现,建议 CI 侧为 tools/CheckPRTemplate.py 的 GitHub 请求增加重试或更长 timeout。

关联变更: 无直接关联

🔴 Fleet Unit test (single card) — 不稳定问题(置信度: 中)

错误类型: 不稳定问题 | 置信度: 中
分析器: 通用分析(fallback)
失败用例: 按根因聚类合并

用例 错误摘要
tests/single_card_tests/test_autocudagraph.py::TestEndToEndPerformance::test_resnext50_accuracy_and_speed CUDAGraph 26.03s 慢于 Eager 22.73s

关键日志:

[Performance Benchmark] ResNeXt50 (1000 steps)
Eager Time:      22.7319 s
CUDAGraph Time:  26.0317 s
Speedup Ratio:   0.87x
AssertionError: 26.0317077729851 not less than 22.731850761920214 : Performance Regression! CUDAGraph (26.03s) is slower than Eager (22.73s).
  • 根因摘要: CUDAGraph 性能基准波动
    失败是端到端性能断言,要求 CUDAGraph 耗时小于 Eager;本次运行反向慢约 14.5%。PR 修改集中在 Fleet pipeline shared weight 和新增 collective/fleet 测试,未改动 autocudagraph 路径,当前证据更符合 CI GPU 负载或性能基准波动。

修复建议:

  1. 已知不稳定/性能波动类问题,请 rerun;若连续复现,再转给 CUDAGraph 维护者排查性能回退。

关联变更: 未发现与本 PR 修改文件直接关联

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 73.17073% with 22 lines in your changes missing coverage. Please review.
⚠️ Please upload report for BASE (develop@5d803b0). Learn more about missing BASE report.

Files with missing lines Patch % Lines
...d/fleet/meta_parallel/parallel_layers/pp_layers.py 73.17% 22 Missing ⚠️

❌ Your patch status has failed because the patch coverage (73.17%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files
@@            Coverage Diff             @@
##             develop   #79336   +/-   ##
==========================================
  Coverage           ?   73.17%           
==========================================
  Files              ?        1           
  Lines              ?       82           
  Branches           ?        0           
==========================================
  Hits               ?       60           
  Misses             ?       22           
  Partials           ?        0           

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants