[wip] make Float8Linear amax init more FSDP+compile friendly #171

vkuzo · 2023-12-28T22:12:46Z

Summary:

Need to use functional collectives to help torch.compile trace through distributed code
(https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py)

Numerics are off, debugging

Test Plan:

Reviewers:

Subscribers:

Tasks:

Tags:

Summary: Need to use functional collectives to help torch.compile trace through distributed code (https://github.com/pytorch/pytorch/blob/main/torch/distributed/_functional_collectives.py) Numerics are off, debugging Test Plan: Reviewers: Subscribers: Tasks: Tags:

Summary: This adds a couple of config options to unbreak autocast + compile + FSDP + Float8Linear. To enable these options, the user needs to do: ``` config.enable_amax_init = False config.enable_pre_and_post_forward = False ``` The `enable_amax_init` config adds the option to disable amax initialization. The reason this is currently broken is: 1. FSDP is not full-graph friendly (regardless of compile) 2. the amax init function has a graph break in distributed code because it uses inplace distributed collectives. I did try to use functional collectives (#171), but that ran into numerical issues with compile, so for now just working around it. 3. graph breaks in Float8Linear code are not supported because of the issue documented in #166 4. so, as a workaround for all of the above, we just skip amax init for now. We do know from NVIDIA that this path is not needed for model convergence, and TE does not support this at all. It was nice for testing but not necessary for training jobs. The second config option disables pre-forward and post-forward. I don't have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8 GPUs with FSDP + compile. Specifically, the thing which is broken in pre-forward/post-forward is assignment on module attributes. My hunch is that this graph breaks if autocast + FSDP are on, and graph breaks are not supported due to (3) above. Pull Request resolved: #172 Test Plan: ``` // unit / integration tests with-proxy test/test_everything.sh // run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors ``` Reviewed By: drisspg Differential Revision: D52468625 Pulled By: vkuzo fbshipit-source-id: be4fac927b8520602ed018e96d7a49056e9c6e06

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 28, 2023

vkuzo mentioned this pull request Dec 31, 2023

enable autocast + compile + FSDP + Float8Linear #172

Closed

drisspg closed this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[wip] make Float8Linear amax init more FSDP+compile friendly #171

[wip] make Float8Linear amax init more FSDP+compile friendly #171

Uh oh!

vkuzo commented Dec 28, 2023

Uh oh!

Uh oh!

[wip] make Float8Linear amax init more FSDP+compile friendly #171

[wip] make Float8Linear amax init more FSDP+compile friendly #171

Uh oh!

Conversation

vkuzo commented Dec 28, 2023

Uh oh!

Uh oh!