[Research][PyTorch 2.6] Save compiled triton kernel as device binary code #1792

vlad-penkin · 2024-08-07T12:33:04Z

There is a plan to enable AOT Inductor for Intel GPU in PyTorch 2.6. While working on the design, PyTorch Team realized that the Triton kernel is now saved as SPIR-V(IR), while CUDA is cubin(device code binary) which will affect E2E performance:

Current implementation for the deployment require load up SPIR-V, and then compile it into device binary code through IGC. This will cause will cause more compilation time compared to CUDA when kernel is run in the deployment environment.

PyTorch Team is asking whether Triton can save the compiled kernel as device binary code, and load it with L0 runtime.

etaf · 2024-08-08T15:40:05Z

Seems we can retrieve native binary from level zero module using zeModuleGetNativeBinary.

alexbaden · 2024-08-08T15:54:48Z

Yes, assuming they work L0 has APIs we should be able to leverage.
The difference is in the way the compiler/driver works. For NVIDIA, they can call ptxas to assemble the PTX to cubin and then pass the cubin to their runtime. For us, we actually compile the spirv to machine code during the driver stage. So, I need to either lift the compilation of spirv to native binary out of driver and into compiler, or find a way to get the paths to the driver without breaking triton layering.

alexbaden · 2024-08-08T15:56:31Z

I wanted to look into this to see if it could be related to #1721, but the numbers don't quite match so I suppose I am not optimistic. Still, this could be a nice win for us as compilation can be 100-300ms, especially if there are register spills and we recompile.

etaf · 2024-08-08T15:58:36Z

I think this maybe the solution:

we can retrieve native binary from level zero module using zeModuleGetNativeBinary.
And here is the example: https://github.com/oneapi-src/oneDNN/blob/2e7b691217ff17497aebd7e565fa1701f8a42396/src/gpu/intel/sycl/utils.cpp#L211

Then to reconstruct the L0 model in deployment, create level zero module by set the ze_module_format_t as ZE_MODULE_FORMAT_NATIVE in zeModuleCreate
Here is the example: https://github.com/oneapi-src/oneDNN/blob/2e7b691217ff17497aebd7e565fa1701f8a42396/src/gpu/intel/sycl/l0/utils.cpp#L184

alexbaden · 2024-08-09T01:12:24Z

I have a prototype working. The level zero APIs are the easy part - we have to make significant changes to our triton compilation flow to fit this into Triton's architecture. Fortunately, I think I can adjust the compilation flow while preserving the existing Triton layering. I will clean up my prototype and post it as a draft PR for review tomorrow.

vlad-penkin · 2024-10-21T13:39:38Z

Blocked by

Determine appropriate device architecture during compile stage #2493

vlad-penkin added research enhancement New feature or request upstream: pytorch labels Aug 7, 2024

vlad-penkin added this to the 0.3 [Triton] Language and Runtime milestone Aug 7, 2024

alexbaden self-assigned this Aug 8, 2024

alexbaden mentioned this issue Sep 6, 2024

Support native code binary representation for XPU backend #2148

Merged

alexbaden closed this as completed in #2148 Sep 25, 2024

alexbaden closed this as completed in 5b4d89a Sep 25, 2024

alexbaden reopened this Sep 25, 2024

alexbaden linked a pull request Sep 25, 2024 that will close this issue

[testing] Enable generating cached native code by default #2350

Draft

vlad-penkin mentioned this issue Oct 2, 2024

Investigate commonality between AOT Inductor/Triton/SPIRV Runner and refactor SYCL kernel launcher #2395

Closed

alexbaden mentioned this issue Oct 15, 2024

Determine appropriate device architecture during compile stage #2493

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Research][PyTorch 2.6] Save compiled triton kernel as device binary code #1792

[Research][PyTorch 2.6] Save compiled triton kernel as device binary code #1792

vlad-penkin commented Aug 7, 2024

etaf commented Aug 8, 2024

alexbaden commented Aug 8, 2024

alexbaden commented Aug 8, 2024

etaf commented Aug 8, 2024

alexbaden commented Aug 9, 2024

vlad-penkin commented Oct 21, 2024

[Research][PyTorch 2.6] Save compiled triton kernel as device binary code #1792

[Research][PyTorch 2.6] Save compiled triton kernel as device binary code #1792

Comments

vlad-penkin commented Aug 7, 2024

etaf commented Aug 8, 2024

alexbaden commented Aug 8, 2024

alexbaden commented Aug 8, 2024

etaf commented Aug 8, 2024

alexbaden commented Aug 9, 2024

vlad-penkin commented Oct 21, 2024