[NPU]:Add support for the cross_entropy operator.#1148
Conversation
|
|
@Tcc0403 would you mind having a preview? |
|
@Tcc0403 Perhaps you might have missed this PR😁 |
|
The cross entropy kernel is quite different from the current implementation which calculates gradients in the forward pass. I assume that it is quite memory intensive for npu so we seperate it into two kernels due to ub size constraint. I wonder whether the original method (gradients calc in forward pass) is achievable or not. If not, I'm totally fine with standalone fwd/bwd kernels. However, I would like to add a backward kernel for gpu's cross entropy as well. |
In fact, the original method is feasible. However, the original method almost completed all the calculations in the forward pass, including the forward calculation of loss and gradient calculation. This resulted in extremely poor forward performance. My intention was to have the forward and backward passes perform their respective tasks in a balanced manner to achieve optimal performance. In short, both the original method and the current method are feasible, depending on which one you are willing to adopt. |
|
Let's do the orignal method first, since the kernel is directly used in fused linear cross entropy op. As long as the full (forward+pass) perf is acceptable, it should be fine. |
I have completed the revision. Please check it. |
Summary
To address the ub overflow issue in the benchmark, we have added an operator with a NPU-friendly implementation of cross_entropy .
The current performance is 5-6 times higher than the native code of the GPU, and it is only slightly lower than that of Hugging Face. Further research will be conducted in the future.
Testing Done
make testto ensure correctnessmake checkstyleto ensure code stylemake test-convergenceto ensure convergence