Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Torch Inductor] The IPEX uses conservative configuration on Triton Intel GPU. #759

Open
chengjunlu opened this issue Mar 27, 2024 · 3 comments

Comments

@chengjunlu
Copy link
Contributor

The IPEX uses the conservative num_warps on Triton intel GPU. Unlike the NV, Intel GPU support the num_warps up to 64.

The Torch inductor may chose the sub-optimal Triton kernel to use.

[2024-03-26 09:25:34,485] torch._inductor.triton_heuristics: [DEBUG] Benchmark all input configs get:
[2024-03-26 09:25:34,485] torch._inductor.triton_heuristics: [DEBUG] XBLOCK: 1, num_warps: 2, num_ctas: 1, num_stages: 1: 5.263040, nreg 0, nspill 0, #shared-mem 0
[2024-03-26 09:25:34,485] torch._inductor.triton_heuristics: [DEBUG] XBLOCK: 8, num_warps: 4, num_ctas: 1, num_stages: 1: 0.997920, nreg 0, nspill 0, #shared-mem 0
[2024-03-26 09:25:34,485] torch._inductor.triton_heuristics: [DEBUG] XBLOCK: 32, num_warps: 8, num_ctas: 1, num_stages: 1: 0.689840, nreg 0, nspill 0, #shared-mem 0
@riverliuintel
Copy link

@EikanWang Any comments about this ticket?

@vlad-penkin vlad-penkin added the enhancement New feature or request label Apr 17, 2024
@EikanWang
Copy link
Contributor

We have verified different configurations from the E2E performance perspective, like enlarging num_warps. But it does not impact E2E performance significantly. In the future, we will utilize Inductor autotune to get the better configuration.

What's the impact on the Triton?

@chengjunlu
Copy link
Contributor Author

chengjunlu commented Apr 25, 2024

The configuration information is dumped from the log information of the inductor autotune.
From the log, it seems that the num_warps is only up to 8 and the 8 is the best performance among the value: 2, 4, 8.
It's just in case it seems doesn't try the larger num_warps on Intel GPU.

It is fine as long as the performance is good with the small num_warps.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants