This repository was archived by the owner on Aug 7, 2024. It is now read-only.

Combine amax reduction calls #163

Closed

y-sq wants to merge 1 commit into main from combine-reduction

Contributor

y-sq commented Dec 15, 2023 •

edited

Loading

~~Add an option to combine the amax sync reduction~~ (Use combine-reduction as the default behavior)

Combine the reduction call of each type amax scaling factor (totally 3 all_reduce calls). We can also further combine them into one single call.
- Verified other tests can still pass. So we don't need to change existing benchmark code.
  - pytest test/test_base.py
  - ./test/test_fsdp.sh
Tested the new option using small llama models with 8 fsdp groups. Time taken by sync_float8_amax_and_scale_history reduced from 29ms[1] to 10ms[2].

[1] Traces without combine reduction, https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.138932292910521.json.gz&bucket=acadia
[2] https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.202842416426594.json.gz&bucket=acadia
* Results from trace[2] was updated to the correct number.
** Need Meta internal access to open these traces.

facebook-github-bot added the CLA Signed label

y-sq force-pushed the bench-multi-gpu branch from 78b58f9 to afc9496 Compare

December 16, 2023 01:14

y-sq changed the base branch from bench-multi-gpu to main

December 18, 2023 22:58

y-sq force-pushed the combine-reduction branch from 4c6badd to 174b08a Compare

December 18, 2023 22:58

y-sq marked this pull request as ready for review

December 18, 2023 22:59

Contributor

facebook-github-bot commented Dec 18, 2023

@y-sq has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

y-sq changed the title ~~conbine-reduction~~ Combine amax reduction calls

drisspg reviewed

View reviewed changes

float8_experimental/float8_linear_utils.py Outdated

+                      fp8_layers = get_float8_layers(model, fp8_classes)
+                  if dist.is_initialized():
+                      # TODO: Testing if combine_reduction improves performance.

Contributor

drisspg Dec 18, 2023

Seems like ti does right so can remove, should we make this the default behavior?

drisspg reviewed

View reviewed changes

float8_experimental/float8_linear_utils.py Outdated

+                              device="cuda",
+                              requires_grad=False,
+                          )
+                          # print("fp8_amax_x_tensor, ", fp8_amax_x_tensor)

Contributor

drisspg Dec 18, 2023

nit: probs remove right?

drisspg reviewed

View reviewed changes

float8_experimental/float8_linear_utils.py

		def sync_float8_amax_and_scale_history(
		model: torch.nn.Module, fp8_classes=None

Contributor

drisspg Dec 18, 2023

we should probs document this function and also what the new args do / how they should be used, since I assume you get all the fp8_layers once and then pass that in every iteration

drisspg approved these changes

View reviewed changes

Contributor

drisspg left a comment •

edited

Loading

Some small comments and I think you need to run ufmt format . but otherwise awesome speed ups!!

vkuzo reviewed

View reviewed changes

float8_experimental/float8_linear_utils.py Outdated

+                          )
+                          # print("fp8_amax_x_tensor, ", fp8_amax_x_tensor)
+                          dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX)
+                          dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX)

Contributor

vkuzo Dec 18, 2023

fp8_amax_w_tensor?

Contributor

drisspg Dec 18, 2023

ohh damn good catch

vkuzo reviewed

View reviewed changes

float8_experimental/float8_linear_utils.py Outdated

+                          # print("fp8_amax_x_tensor, ", fp8_amax_x_tensor)
+                          dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX)
+                          dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX)
+                          dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX)

Contributor

vkuzo Dec 18, 2023

dL_dY?

Contributor Author

y-sq Dec 18, 2023

Thanks!!!

vkuzo reviewed

View reviewed changes

float8_experimental/float8_linear_utils.py Outdated

               def sync_float8_amax_and_scale_history(
-                  model: torch.nn.Module, fp8_classes=None
+                  model: torch.nn.Module, fp8_classes=None, fp8_layers=None, combine_reduction=False

Contributor

vkuzo Dec 18, 2023

can we measure time on single GPU, and if that's a net positive as well just delete the old code path? It would be great to keep things simple.

Contributor Author

y-sq Dec 18, 2023 •

edited

Loading

Did you mean the combine_reduction option or the fp8_layers option?

For combine_reduction, I think we can remove it and keep combine_reduction=True as default.

For fp_layers, we have many exiting tests and benchmarks (such as single-gpu llama_7B benchmarks) that use the original call (sync_float8_amax_and_scale_history(model)). So I kept it as optional and can support None

y-sq force-pushed the combine-reduction branch 4 times, most recently from 160a899 to 435df2d Compare

December 19, 2023 00:21

Contributor Author

y-sq commented Dec 19, 2023

Updates:

Add comments and remove unused comments
Remove the option of "combine_reduction", always combine reduction instead
Fix to use the correct scaling factors of fp8_amax_w_tensor and fp8_amax_dL_dY_tensor

Contributor

facebook-github-bot commented Dec 19, 2023

@y-sq has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator.

Contributor Author

y-sq commented Dec 19, 2023

I ran the format check on my devgpu server, which didn't give any errors:

$ ufmt check .
✨ 22 files already formatted ✨

However, the check on github still failed.


          Combine amax reduction calls (#163)

a9f12a0

Summary:
~~Add an option to combine the amax sync reduction~~ (Use combine-reduction as the default behavior)
- Combine the reduction call of each type amax scaling factor (totally 3 all_reduce calls). We can also further combine them into one single call.
  - Verified other tests can still pass. So we don't need to change existing benchmark code.
    - pytest test/test_base.py
    - ./test/test_fsdp.sh 
- Tested the new option using small llama models with 8 fsdp groups. Time taken by sync_float8_amax_and_scale_history reduced from 29ms[1] to 3ms[2].

[1] Traces without combine reduction, https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.138932292910521.json.gz&bucket=acadia
[2] https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.202842416426594.json.gz&bucket=acadia
\* Trace[2] was updated after addressing the comments.
\*\* Need Meta internal access to open these traces.


Reviewed By: drisspg

Differential Revision: D52271595

Pulled By: y-sq

facebook-github-bot force-pushed the combine-reduction branch from 435df2d to a9f12a0 Compare

December 20, 2023 06:33

Contributor

facebook-github-bot commented Dec 20, 2023

This pull request was exported from Phabricator. Differential Revision: D52271595

facebook-github-bot added the fb-exported label

facebook-github-bot closed this in

b099049

facebook-github-bot added the Merged label

Contributor

facebook-github-bot commented Dec 20, 2023

@y-sq merged this pull request in b099049.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

CLA Signed fb-exported Merged