Skip to content
This repository was archived by the owner on Aug 7, 2024. It is now read-only.

[wip] enable Float8Tensor as subgraph boundary #166

Closed
wants to merge 1 commit into from

Conversation

vkuzo
Copy link
Contributor

@vkuzo vkuzo commented Dec 19, 2023

Summary:

Explorations to see how we can enable graph breaks with Float8Tensor at the boundary.

This may be blocking composability of Float8Linear + FSDP + torch.compile.

Test Plan:

pytest test/test_compile.py -s --sw -k graph_break
// currently broken
// logs: https://gist.github.com/vkuzo/53e92dfd91879c0710975b8a657fd91d

pytest test/test_compile.py -s --sw -k test_float8_graph_input 2>&1 | with-proxy gh gist create
// works!
// logs: https://gist.github.com/vkuzo/7a4c7252010157aa49f48a24046ec014

pytest test/test_compile.py -s --sw -k test_float8_graph_output 2>&1 | with-proxy gh gist create
// broken, has fake tensors when it should have real tensors
// logs: https://gist.github.com/vkuzo/35230bb05902733a58c19a7e1f04da45

Reviewers:

Subscribers:

Tasks:

Tags:

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 19, 2023
@vkuzo vkuzo requested a review from bdhirsh December 19, 2023 16:11
@drisspg
Copy link
Contributor

drisspg commented Dec 19, 2023

Since I think that this might be hard to solve, should we make an issue tracking this and xfail the test?

vkuzo added a commit that referenced this pull request Dec 31, 2023
Summary:

This adds a couple of config options to unbreak autocast + compile +
FSDP + Float8Linear. To enable these options, the user needs to do:

```
config.enable_amax_init = False
config.enable_pre_and_post_forward = False
```

The `enable_amax_init` config adds the option to disable amax initialization.
The reason this is currently broken is:
1. FSDP is not full-graph friendly (regardless of compile)
2. the amax init function has a graph break in distributed code because
   it uses inplace distributed collectives.  I did try to use functional
   collectives, but that ran into numerical issues with compile, so for
   now just working around it.
3. graph breaks in Float8Linear code are not supported because of the
   issue documented in
   #166
4. so, as a workaround for all of the above, we just skip amax init for
   now.  We do know from NVIDIA that this path is not needed for model
   convergence, and TE does not support this at all. It was nice for
   testing but not necessary for training jobs.

The second config option disables pre-forward and post-forward. I don't
have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8
GPUs with FSDP + compile. Specifically, the thing which is broken in
pre-forward/post-forward is assignment on module attributes. My hunch is
that this graph breaks if autocast + FSDP are on, and graph breaks are
not supported due to (3) above.

Test Plan:

```
// unit / integration tests
with-proxy test/test_everything.sh

// run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit that referenced this pull request Dec 31, 2023
Summary:

This adds a couple of config options to unbreak autocast + compile +
FSDP + Float8Linear. To enable these options, the user needs to do:

```
config.enable_amax_init = False
config.enable_pre_and_post_forward = False
```

The `enable_amax_init` config adds the option to disable amax initialization.
The reason this is currently broken is:
1. FSDP is not full-graph friendly (regardless of compile)
2. the amax init function has a graph break in distributed code because
   it uses inplace distributed collectives.  I did try to use functional
   collectives, but that ran into numerical issues with compile, so for
   now just working around it.
3. graph breaks in Float8Linear code are not supported because of the
   issue documented in
   #166
4. so, as a workaround for all of the above, we just skip amax init for
   now.  We do know from NVIDIA that this path is not needed for model
   convergence, and TE does not support this at all. It was nice for
   testing but not necessary for training jobs.

The second config option disables pre-forward and post-forward. I don't
have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8
GPUs with FSDP + compile. Specifically, the thing which is broken in
pre-forward/post-forward is assignment on module attributes. My hunch is
that this graph breaks if autocast + FSDP are on, and graph breaks are
not supported due to (3) above.

Test Plan:

```
// unit / integration tests
with-proxy test/test_everything.sh

// run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors
```

Reviewers:

Subscribers:

Tasks:

Tags:
vkuzo added a commit that referenced this pull request Jan 2, 2024
Summary:

This adds a couple of config options to unbreak autocast + compile +
FSDP + Float8Linear. To enable these options, the user needs to do:

```
config.enable_amax_init = False
config.enable_pre_and_post_forward = False
```

The `enable_amax_init` config adds the option to disable amax initialization.
The reason this is currently broken is:
1. FSDP is not full-graph friendly (regardless of compile)
2. the amax init function has a graph break in distributed code because
   it uses inplace distributed collectives.  I did try to use functional
   collectives, but that ran into numerical issues with compile, so for
   now just working around it.
3. graph breaks in Float8Linear code are not supported because of the
   issue documented in
   #166
4. so, as a workaround for all of the above, we just skip amax init for
   now.  We do know from NVIDIA that this path is not needed for model
   convergence, and TE does not support this at all. It was nice for
   testing but not necessary for training jobs.

The second config option disables pre-forward and post-forward. I don't
have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8
GPUs with FSDP + compile. Specifically, the thing which is broken in
pre-forward/post-forward is assignment on module attributes. My hunch is
that this graph breaks if autocast + FSDP are on, and graph breaks are
not supported due to (3) above.

Test Plan:

```
// unit / integration tests
with-proxy test/test_everything.sh

// run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors
```

Reviewers:

Subscribers:

Tasks:

Tags:
facebook-github-bot pushed a commit that referenced this pull request Jan 3, 2024
Summary:
This adds a couple of config options to unbreak autocast + compile + FSDP + Float8Linear. To enable these options, the user needs to do:

```
config.enable_amax_init = False
config.enable_pre_and_post_forward = False
```

The `enable_amax_init` config adds the option to disable amax initialization. The reason this is currently broken is:
1. FSDP is not full-graph friendly (regardless of compile)
2. the amax init function has a graph break in distributed code because it uses inplace distributed collectives.  I did try to use functional collectives (#171), but that ran into numerical issues with compile, so for now just working around it.
3. graph breaks in Float8Linear code are not supported because of the issue documented in #166
4. so, as a workaround for all of the above, we just skip amax init for now.  We do know from NVIDIA that this path is not needed for model convergence, and TE does not support this at all. It was nice for testing but not necessary for training jobs.

The second config option disables pre-forward and post-forward. I don't have a repro in a unit test for now, but this does unbreak LLaMa 7B on 8 GPUs with FSDP + compile. Specifically, the thing which is broken in pre-forward/post-forward is assignment on module attributes. My hunch is that this graph breaks if autocast + FSDP are on, and graph breaks are not supported due to (3) above.

Pull Request resolved: #172

Test Plan:
```
// unit / integration tests
with-proxy test/test_everything.sh

// run the LLaMa 7b trainer on 8 GPUs with autocast + compile + FSDP + Float8Linear, no compile errors
```

Reviewed By: drisspg

Differential Revision: D52468625

Pulled By: vkuzo

fbshipit-source-id: be4fac927b8520602ed018e96d7a49056e9c6e06
@vkuzo vkuzo force-pushed the 20231219_graph_break branch from 94d5188 to 3d2a32e Compare January 10, 2024 15:21
@bdhirsh
Copy link
Contributor

bdhirsh commented Jan 10, 2024

It might be worth treating these two tests separately:

(1) a test with a graph break in the middle

(2) a test with no graph breaks, but where we manually pass a Float8Tensor as a user input (or return one as a user output)

That way we can tell if the problem is specific to "Float8Tensor as graph input", or if it some more subtle bug around graph breaks

@vkuzo
Copy link
Contributor Author

vkuzo commented Jan 10, 2024

opened pytorch/pytorch#117115 to track this in PyTorch repo

Summary:

In https://github.com/pytorch/pytorch/pull/114311/files, the signature
expected by traceable subclasses changed.

This PR updates `Float8Tensor` to the new spec.

Note: doesn't work yet, need to debug.

This may be blocking composability of Float8Linear + FSDP +
torch.compile.

Test Plan:

```
pytest test/test_compile.py -s --sw -k graph_break
// currently broken
// logs: https://gist.github.com/vkuzo/ba98a01a459fb9c966f167d8ecca1780
```

Reviewers:

Subscribers:

Tasks:

Tags:
@vkuzo vkuzo force-pushed the 20231219_graph_break branch from 37ab944 to 03149ef Compare January 10, 2024 16:04
@vkuzo
Copy link
Contributor Author

vkuzo commented Jan 10, 2024

It might be worth treating these two tests separately:

(1) a test with a graph break in the middle

(2) a test with no graph breaks, but where we manually pass a Float8Tensor as a user input (or return one as a user output)

That way we can tell if the problem is specific to "Float8Tensor as graph input", or if it some more subtle bug around graph breaks

@bdhirsh , good idea, I added testing for input and output and added the results to the summary The "float8tensor as output" case is fishy, we see fake tensors.

@vkuzo vkuzo changed the title [doesn't work yet] enable Float8Tensor as subgraph boundary [wip] enable Float8Tensor as subgraph boundary Jan 16, 2024
@vkuzo vkuzo mentioned this pull request Jan 16, 2024
@drisspg drisspg closed this Jan 24, 2024
facebook-github-bot pushed a commit that referenced this pull request Jan 24, 2024
Summary:
This is a copy with some tweaks of: #166

Pull Request resolved: #196

Reviewed By: bdhirsh

Differential Revision: D53056829

Pulled By: drisspg

fbshipit-source-id: 1856d2e4bb1410161e326795014a56d0f91249e0
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants