Skip to content

[AutoParallel] support sharding tensor-fusion save&load #69823

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 13 commits into from
Dec 12, 2024

Conversation

AndSonder
Copy link
Contributor

@AndSonder AndSonder commented Nov 29, 2024

PR Category

Auto Parallel

PR Types

New features

Description

为 sharding tensor-fusion 场景支持 save&load 策略

tensor-fusion save&load 适配方案概要

1. 背景

为了减少通信开销,tensor fusion 通过将多个小的张量合并成一个较大的张量。这种操作会导致原本需要独立切分的张量被合并成一个整体,而这个合并后的张量会分布到不同的设备上,可能会导致 不均匀切分。因此,在 tensor fusion 后,需要对 save&load 适配

2. 方案设计

方案的主要思路是在 save 和 load 的时候对参数进行处理。在 save 的时候,在 state_dict 函数中将不均匀的 slice 优化器参数,根据分组的信息通信(all_gather)回全局视角下的 tensor,再根据 sharding 的 axis 保留当前卡的部分。换句话说就是让 state_dict 和不开 tensor-fusion 时的状态一样。

在 load 的时候,由于我们保存的是均匀切分的参数,我们需要再重新给他转化回到不均匀切分的状态

2.1 保存 (save) 优化器状态

保存优化器状态时,优化器的参数可能已经在多个设备上进行了切分。为了保证保存时的状态和在没有 tensor-fusion 的情况下保持一致,我们需要对每个设备上的优化器参数进行以下处理:

  • 收集切分的参数:对于不均匀切分的参数,通过分布式通信(如 all_gather)将每个设备上的部分参数收集到全局视角下。
  • 按设备切分:收集后的参数会根据分布式 sharding 的 axis 信息,提取每个设备应持有的部分,确保在保存时每个设备的状态一致。

2.2 加载 (load) 优化器状态

加载时,由于保存的参数已经是均匀切分的状态,原本在保存时为非均匀切分的参数需要在加载过程中恢复为原来的非均匀切分状态。具体步骤包括:

  • 重新切分参数:使用 sharding 的 axis 信息,根据保存的均匀切分状态,恢复到每个设备上的非均匀切分状态。
  • 保证设备切分一致性:每个设备只加载它所需要的部分

Pcard-76459

Copy link

paddle-bot bot commented Nov 29, 2024

你的PR提交成功,感谢你对开源项目的贡献!
请关注后续CI自动化测试结果,详情请参考Paddle-CI手册
Your PR has been submitted. Thanks for your contribution!
Please wait for the result of CI firstly. See Paddle CI Manual for details.

@paddle-bot paddle-bot bot added the contributor External developers label Nov 29, 2024
@AndSonder AndSonder force-pushed the tensor-fusion-save-load branch from 2e2d6ce to e14cc9f Compare December 10, 2024 12:57
jeff41404
jeff41404 previously approved these changes Dec 11, 2024
Copy link
Contributor

@jeff41404 jeff41404 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

winter-wang
winter-wang previously approved these changes Dec 11, 2024
Copy link
Contributor

@winter-wang winter-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@AndSonder AndSonder dismissed stale reviews from winter-wang and jeff41404 via 55bd4f4 December 11, 2024 04:43
Copy link
Contributor

@winter-wang winter-wang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@SigureMo SigureMo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTMeow 🐾 for API annotation change

@winter-wang winter-wang merged commit 5c2ded5 into PaddlePaddle:develop Dec 12, 2024
27 of 28 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
contributor External developers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants