-
Notifications
You must be signed in to change notification settings - Fork 44
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Research][PyTorch 2.6] Save compiled triton kernel as device binary code #1792
Comments
Seems we can retrieve native binary from level zero module using |
Yes, assuming they work L0 has APIs we should be able to leverage. |
I wanted to look into this to see if it could be related to #1721, but the numbers don't quite match so I suppose I am not optimistic. Still, this could be a nice win for us as compilation can be 100-300ms, especially if there are register spills and we recompile. |
I think this maybe the solution: we can retrieve native binary from level zero module using zeModuleGetNativeBinary. Then to reconstruct the L0 model in deployment, create level zero module by set the ze_module_format_t as ZE_MODULE_FORMAT_NATIVE in |
I have a prototype working. The level zero APIs are the easy part - we have to make significant changes to our triton compilation flow to fit this into Triton's architecture. Fortunately, I think I can adjust the compilation flow while preserving the existing Triton layering. I will clean up my prototype and post it as a draft PR for review tomorrow. |
There is a plan to enable AOT Inductor for Intel GPU in PyTorch 2.6. While working on the design, PyTorch Team realized that the Triton kernel is now saved as SPIR-V(IR), while CUDA is cubin(device code binary) which will affect E2E performance:
PyTorch Team is asking whether Triton can save the compiled kernel as device binary code, and load it with L0 runtime.
The text was updated successfully, but these errors were encountered: