-
Notifications
You must be signed in to change notification settings - Fork 96
Add custom decompositions for cross entropy loss for the nvfuser executor #2043
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
want to double check the forward implementation. kinda looks a bit strange to me.
for more information, see https://pre-commit.ci
a60395a
to
b045ccb
Compare
Co-authored-by: Masaki Kozuki <mkozuki@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just wondering if you have some numbers of cross entropy loss to compare this with the existing ones e.g. https://github.com/Lightning-AI/lightning-thunder/blob/c6928015914fdbdd708fd8e87fbd9d9c1b4a40ef/thunder/executors/triton_crossentropy.py?
@crcrpar I did compare performance against torchcompile (which uses Triton - but is that the same as the link you sent?) This is a benchmark I was going to add: |
@IvanYashchuk , @jjsjann123 would you like to review again? |
@beverlylytle Would you like to take a look, too? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Co-authored-by: beverlylytle <57254617+beverlylytle@users.noreply.github.com>
Co-authored-by: beverlylytle <57254617+beverlylytle@users.noreply.github.com>
@IvanYashchuk did you want to take a last look? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
…utor (#2043) This PR adds custom decompositions for Cross-Entropy Loss for the nvFuser executor. Adding these custom decompositions improves performance and allows further optimization in nvFuser. For cross-entropy loss forward: 1. We move the take_along_axis computation before the log softmax is computed. This allows us to reduce memory traffic for the inputs. For cross-entropy loss backward: 1. We replace a scatter-op with a iota and where op as we don't have support for scatter exposed in nvFuser. 2. We can get rid of a reduction that shows up when backward is computed as nll_loss backward followed by log softmax backward.
This PR adds custom decompositions for Cross-Entropy Loss for the nvFuser executor.
Adding these custom decompositions improves performance and allows further optimization in nvFuser.
For cross-entropy loss forward:
For cross-entropy loss backward:
cc @tfogal