-
Notifications
You must be signed in to change notification settings - Fork 19
Conversation
78b58f9
to
afc9496
Compare
4c6badd
to
174b08a
Compare
@y-sq has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
fp8_layers = get_float8_layers(model, fp8_classes) | ||
|
||
if dist.is_initialized(): | ||
# TODO: Testing if combine_reduction improves performance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like ti does right so can remove, should we make this the default behavior?
device="cuda", | ||
requires_grad=False, | ||
) | ||
# print("fp8_amax_x_tensor, ", fp8_amax_x_tensor) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: probs remove right?
def sync_float8_amax_and_scale_history( | ||
model: torch.nn.Module, fp8_classes=None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probs document this function and also what the new args do / how they should be used, since I assume you get all the fp8_layers once and then pass that in every iteration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small comments and I think you need to run ufmt format .
but otherwise awesome speed ups!!
) | ||
# print("fp8_amax_x_tensor, ", fp8_amax_x_tensor) | ||
dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX) | ||
dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fp8_amax_w_tensor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ohh damn good catch
# print("fp8_amax_x_tensor, ", fp8_amax_x_tensor) | ||
dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX) | ||
dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX) | ||
dist.all_reduce(fp8_amax_x_tensor, op=dist.ReduceOp.MAX) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dL_dY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!!!
def sync_float8_amax_and_scale_history( | ||
model: torch.nn.Module, fp8_classes=None | ||
model: torch.nn.Module, fp8_classes=None, fp8_layers=None, combine_reduction=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we measure time on single GPU, and if that's a net positive as well just delete the old code path? It would be great to keep things simple.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean the combine_reduction
option or the fp8_layers
option?
For combine_reduction
, I think we can remove it and keep combine_reduction=True
as default.
For fp_layers
, we have many exiting tests and benchmarks (such as single-gpu llama_7B benchmarks) that use the original call (sync_float8_amax_and_scale_history(model)
). So I kept it as optional and can support None
160a899
to
435df2d
Compare
Updates:
|
@y-sq has imported this pull request. If you are a Meta employee, you can view this diff on Phabricator. |
I ran the format check on my devgpu server, which didn't give any errors:
However, the check on github still failed. |
Summary: ~~Add an option to combine the amax sync reduction~~ (Use combine-reduction as the default behavior) - Combine the reduction call of each type amax scaling factor (totally 3 all_reduce calls). We can also further combine them into one single call. - Verified other tests can still pass. So we don't need to change existing benchmark code. - pytest test/test_base.py - ./test/test_fsdp.sh - Tested the new option using small llama models with 8 fsdp groups. Time taken by sync_float8_amax_and_scale_history reduced from 29ms[1] to 3ms[2]. [1] Traces without combine reduction, https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.138932292910521.json.gz&bucket=acadia [2] https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.202842416426594.json.gz&bucket=acadia \* Trace[2] was updated after addressing the comments. \*\* Need Meta internal access to open these traces. Reviewed By: drisspg Differential Revision: D52271595 Pulled By: y-sq
435df2d
to
a9f12a0
Compare
This pull request was exported from Phabricator. Differential Revision: D52271595 |
Add an option to combine the amax sync reduction(Use combine-reduction as the default behavior)[1] Traces without combine reduction, https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.138932292910521.json.gz&bucket=acadia
[2] https://www.internalfb.com/intern/perfdoctor/trace_view?filepath=tree/traces/trace.202842416426594.json.gz&bucket=acadia
* Results from trace[2] was updated to the correct number.
** Need Meta internal access to open these traces.