[AutoParallel] support sharding tensor-fusion save&load #69823

AndSonder · 2024-11-29T07:25:31Z

PR Category

Auto Parallel

PR Types

New features

Description

为 sharding tensor-fusion 场景支持 save&load 策略

tensor-fusion save&load 适配方案概要

1. 背景

为了减少通信开销，tensor fusion 通过将多个小的张量合并成一个较大的张量。这种操作会导致原本需要独立切分的张量被合并成一个整体，而这个合并后的张量会分布到不同的设备上，可能会导致不均匀切分。因此，在 tensor fusion 后，需要对 save&load 适配

2. 方案设计

方案的主要思路是在 save 和 load 的时候对参数进行处理。在 save 的时候，在 state_dict 函数中将不均匀的 slice 优化器参数，根据分组的信息通信（all_gather）回全局视角下的 tensor，再根据 sharding 的 axis 保留当前卡的部分。换句话说就是让 state_dict 和不开 tensor-fusion 时的状态一样。

在 load 的时候，由于我们保存的是均匀切分的参数，我们需要再重新给他转化回到不均匀切分的状态

2.1 保存 (save) 优化器状态

保存优化器状态时，优化器的参数可能已经在多个设备上进行了切分。为了保证保存时的状态和在没有 tensor-fusion 的情况下保持一致，我们需要对每个设备上的优化器参数进行以下处理：

收集切分的参数：对于不均匀切分的参数，通过分布式通信（如 all_gather）将每个设备上的部分参数收集到全局视角下。
按设备切分：收集后的参数会根据分布式 sharding 的 axis 信息，提取每个设备应持有的部分，确保在保存时每个设备的状态一致。

2.2 加载 (load) 优化器状态

加载时，由于保存的参数已经是均匀切分的状态，原本在保存时为非均匀切分的参数需要在加载过程中恢复为原来的非均匀切分状态。具体步骤包括：

重新切分参数：使用 sharding 的 axis 信息，根据保存的均匀切分状态，恢复到每个设备上的非均匀切分状态。
保证设备切分一致性：每个设备只加载它所需要的部分

Pcard-76459

paddle-bot · 2024-11-29T07:25:37Z

你的PR提交成功，感谢你对开源项目的贡献!
请关注后续CI自动化测试结果，详情请参考Paddle-CI手册。
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

…r/Paddle into tensor-fusion-save-load

… tensor-fusion-save-load

jeff41404

LGTM

winter-wang

LGTM

winter-wang

LGTM

SigureMo

LGTMeow for API annotation change

support sharding tensor-fusion save&load

c1cbaf6

paddle-bot bot added the contributor External developers label Nov 29, 2024

AndSonder and others added 10 commits December 2, 2024 09:55

Merge branch 'PaddlePaddle:develop' into tensor-fusion-save-load

836f3ed

fix

a1a3669

Merge branch 'tensor-fusion-save-load' of https://github.com/andsonde…

dc3a9b9

…r/Paddle into tensor-fusion-save-load

fix

997dc78

fix and add unit test

5030788

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

baca0e1

… tensor-fusion-save-load

fix

07fdaab

fix mp error

33a0195

fix

3087c94

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

e14cc9f

… tensor-fusion-save-load

AndSonder force-pushed the tensor-fusion-save-load branch from 2e2d6ce to e14cc9f Compare December 10, 2024 12:57

jeff41404 previously approved these changes Dec 11, 2024

View reviewed changes

winter-wang previously approved these changes Dec 11, 2024

View reviewed changes

update unit time

55bd4f4

AndSonder dismissed stale reviews from winter-wang and jeff41404 via 55bd4f4 December 11, 2024 04:43

update test

d89446a

winter-wang approved these changes Dec 11, 2024

View reviewed changes

XieYunshen approved these changes Dec 11, 2024

View reviewed changes

SigureMo approved these changes Dec 11, 2024

View reviewed changes

jeff41404 approved these changes Dec 12, 2024

View reviewed changes

winter-wang merged commit 5c2ded5 into PaddlePaddle:develop Dec 12, 2024
27 of 28 checks passed

AndSonder mentioned this pull request Dec 24, 2024

[AutoParallel] optimize sharding stage1 tensor fusion save&load strategy #70309

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[AutoParallel] support sharding tensor-fusion save&load #69823

[AutoParallel] support sharding tensor-fusion save&load #69823

Uh oh!

AndSonder commented Nov 29, 2024 •

edited

Loading

Uh oh!

paddle-bot bot commented Nov 29, 2024

Uh oh!

jeff41404 left a comment

Uh oh!

winter-wang left a comment

Uh oh!

winter-wang left a comment

Uh oh!

SigureMo left a comment

Uh oh!

Uh oh!

Uh oh!

[AutoParallel] support sharding tensor-fusion save&load #69823

[AutoParallel] support sharding tensor-fusion save&load #69823

Uh oh!

Conversation

AndSonder commented Nov 29, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Category

PR Types

Description

1. 背景

2. 方案设计

2.1 保存 (save) 优化器状态

2.2 加载 (load) 优化器状态

Uh oh!

paddle-bot bot commented Nov 29, 2024

Uh oh!

jeff41404 left a comment

Choose a reason for hiding this comment

Uh oh!

winter-wang left a comment

Choose a reason for hiding this comment

Uh oh!

winter-wang left a comment

Choose a reason for hiding this comment

Uh oh!

SigureMo left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

AndSonder commented Nov 29, 2024 •

edited

Loading