-
Notifications
You must be signed in to change notification settings - Fork 19
[wip] enable Float8Tensor as subgraph boundary #166
Conversation
Since I think that this might be hard to solve, should we make an issue tracking this and |
Summary: This adds a couple of config options to unbreak autocast + compile + FSDP + Float8Linear. To enable these options, the user needs to do: ``` config.enable_amax_init = False config.enable_pre_and_post_forward = False ``` The `enable_amax_init` config adds the option to disable amax initialization. The reason this is currently broken is: 1. FSDP is not full-graph friendly (regardless of compile) 2. the amax init function has a graph break in distributed code because it uses inplace distributed collectives. I did try to use functional collectives, but that ran into numerical issues with compile, so for now just working around it. 3. graph breaks in Float8Linear code are not supported because of the issue documented in #166 4. so, as a workaround for all of the above, we just skip amax init for now. We do know from NVIDIA that this path is not needed for model convergence, and TE does not support this at all. It was nice for testing but not necessary for training jobs. The second config option disables pre-forward and post-forward. I don't have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8 GPUs with FSDP + compile. Specifically, the thing which is broken in pre-forward/post-forward is assignment on module attributes. My hunch is that this graph breaks if autocast + FSDP are on, and graph breaks are not supported due to (3) above. Test Plan: ``` // unit / integration tests with-proxy test/test_everything.sh // run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors ``` Reviewers: Subscribers: Tasks: Tags:
Summary: This adds a couple of config options to unbreak autocast + compile + FSDP + Float8Linear. To enable these options, the user needs to do: ``` config.enable_amax_init = False config.enable_pre_and_post_forward = False ``` The `enable_amax_init` config adds the option to disable amax initialization. The reason this is currently broken is: 1. FSDP is not full-graph friendly (regardless of compile) 2. the amax init function has a graph break in distributed code because it uses inplace distributed collectives. I did try to use functional collectives, but that ran into numerical issues with compile, so for now just working around it. 3. graph breaks in Float8Linear code are not supported because of the issue documented in #166 4. so, as a workaround for all of the above, we just skip amax init for now. We do know from NVIDIA that this path is not needed for model convergence, and TE does not support this at all. It was nice for testing but not necessary for training jobs. The second config option disables pre-forward and post-forward. I don't have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8 GPUs with FSDP + compile. Specifically, the thing which is broken in pre-forward/post-forward is assignment on module attributes. My hunch is that this graph breaks if autocast + FSDP are on, and graph breaks are not supported due to (3) above. Test Plan: ``` // unit / integration tests with-proxy test/test_everything.sh // run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors ``` Reviewers: Subscribers: Tasks: Tags:
Summary: This adds a couple of config options to unbreak autocast + compile + FSDP + Float8Linear. To enable these options, the user needs to do: ``` config.enable_amax_init = False config.enable_pre_and_post_forward = False ``` The `enable_amax_init` config adds the option to disable amax initialization. The reason this is currently broken is: 1. FSDP is not full-graph friendly (regardless of compile) 2. the amax init function has a graph break in distributed code because it uses inplace distributed collectives. I did try to use functional collectives, but that ran into numerical issues with compile, so for now just working around it. 3. graph breaks in Float8Linear code are not supported because of the issue documented in #166 4. so, as a workaround for all of the above, we just skip amax init for now. We do know from NVIDIA that this path is not needed for model convergence, and TE does not support this at all. It was nice for testing but not necessary for training jobs. The second config option disables pre-forward and post-forward. I don't have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8 GPUs with FSDP + compile. Specifically, the thing which is broken in pre-forward/post-forward is assignment on module attributes. My hunch is that this graph breaks if autocast + FSDP are on, and graph breaks are not supported due to (3) above. Test Plan: ``` // unit / integration tests with-proxy test/test_everything.sh // run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors ``` Reviewers: Subscribers: Tasks: Tags:
Summary: This adds a couple of config options to unbreak autocast + compile + FSDP + Float8Linear. To enable these options, the user needs to do: ``` config.enable_amax_init = False config.enable_pre_and_post_forward = False ``` The `enable_amax_init` config adds the option to disable amax initialization. The reason this is currently broken is: 1. FSDP is not full-graph friendly (regardless of compile) 2. the amax init function has a graph break in distributed code because it uses inplace distributed collectives. I did try to use functional collectives (#171), but that ran into numerical issues with compile, so for now just working around it. 3. graph breaks in Float8Linear code are not supported because of the issue documented in #166 4. so, as a workaround for all of the above, we just skip amax init for now. We do know from NVIDIA that this path is not needed for model convergence, and TE does not support this at all. It was nice for testing but not necessary for training jobs. The second config option disables pre-forward and post-forward. I don't have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8 GPUs with FSDP + compile. Specifically, the thing which is broken in pre-forward/post-forward is assignment on module attributes. My hunch is that this graph breaks if autocast + FSDP are on, and graph breaks are not supported due to (3) above. Pull Request resolved: #172 Test Plan: ``` // unit / integration tests with-proxy test/test_everything.sh // run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors ``` Reviewed By: drisspg Differential Revision: D52468625 Pulled By: vkuzo fbshipit-source-id: be4fac927b8520602ed018e96d7a49056e9c6e06
94d5188
to
3d2a32e
Compare
It might be worth treating these two tests separately: (1) a test with a graph break in the middle (2) a test with no graph breaks, but where we manually pass a That way we can tell if the problem is specific to "Float8Tensor as graph input", or if it some more subtle bug around graph breaks |
3d2a32e
to
37ab944
Compare
opened pytorch/pytorch#117115 to track this in PyTorch repo |
Summary: In https://github.com/pytorch/pytorch/pull/114311/files, the signature expected by traceable subclasses changed. This PR updates `Float8Tensor` to the new spec. Note: doesn't work yet, need to debug. This may be blocking composability of Float8Linear + FSDP + torch.compile. Test Plan: ``` pytest test/test_compile.py -s --sw -k graph_break // currently broken // logs: https://gist.github.com/vkuzo/ba98a01a459fb9c966f167d8ecca1780 ``` Reviewers: Subscribers: Tasks: Tags:
37ab944
to
03149ef
Compare
@bdhirsh , good idea, I added testing for input and output and added the results to the summary The "float8tensor as output" case is fishy, we see fake tensors. |
Summary:
Explorations to see how we can enable graph breaks with Float8Tensor at the boundary.
This may be blocking composability of Float8Linear + FSDP + torch.compile.
Test Plan:
Reviewers:
Subscribers:
Tasks:
Tags: