Micro optimization for softmax_forward_kernel5#762
Micro optimization for softmax_forward_kernel5#762insop wants to merge 6 commits intokarpathy:masterfrom
softmax_forward_kernel5#762Conversation
- Micro optimize softmax_forward5 - use __shfl_xor_sync for warpReducMax for all threads return the max
|
@gordicaleksa , @ngc92, @ademeure, It would be great if you can take a look at this PR when you get a chance. |
|
could you give a bit more detail about these changes? From a quick look, it seems like you changed a block-wise reduction into just warp-level reduction. Is that correct? |
Hi @ngc92 When I profiled So, I looked more closely at the last part and determined that organizing the memory write as 4 floats improves memory throughput due to better coalesced access. |
|
@insop |
Hi @ngc92
Now, |
Hi @ngc92 Thank you |
This branch includes a micro-optimization for
softmax_forward_kernel5.Summary
usewarpReduceMaxinattention_forward.cuto use__shfl_down_syncto be consistent with the other kernels (reduce to all threads in a warp)micro optimization for
softmax_forward_kernel5Result from
ncu ./profile_gpt2cu: compared to the original code, the with this optimization gain improvements (left: original code, right: modified code):tests done:
./profile_gpt2cu./attention_forward 4./attention_forward 5Output from modified code
Output from the original code
output from
./attention_forward