Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change hash to not require ptxas #2476

Merged
merged 3 commits into from
Oct 17, 2023

Conversation

manbearian
Copy link
Collaborator

I noticed that Triton is using the ptxas version as part of the version hash even for non-CUDA targets. This is an attempt at fixing this. Moving the version calculation to the back-end makes sense to me from an architectural standpoint, so that's my approach here. I'm not as confident in the implementation, so please if folks have any feedback let me know.

@manbearian manbearian requested a review from ptillet as a code owner October 10, 2023 00:00
@manbearian manbearian requested a review from EikanWang October 10, 2023 15:37
@ptillet
Copy link
Collaborator

ptillet commented Oct 13, 2023

I believe the regression is an actual one, and not just noise. Not sure where it comes from though.

@manbearian
Copy link
Collaborator Author

I believe the regression is an actual one, and not just noise. Not sure where it comes from though.

Yes; it does appear that way... i was sitting on this hoping other PRs might hit similar issues, but nothing has. I'll investigate when i have chance; this PR isn't exactly high priority.

@manbearian
Copy link
Collaborator Author

manbearian commented Oct 13, 2023

@ptillet. I looked at the failure log some more now that i know its specific to my change, and i admit, i'm even more confused. The error looks like its indicating that with my changes the A100 results got faster?!?

Looking at test_performance.py the first value is the current result and the second value is the reference result

test_performance.py:

triton.testing.assert_close(cur_gpu_util, ref_gpu_util, atol=0.02, rtol=0.01)

Results:

=========================== short test summary info ============================
FAILED test_performance.py::test_matmul[1024-1024-1024-float16] - AssertionError:  0.2981213927268982 is not close to 0.35499998927116394 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[2048-2048-2048-float16] - AssertionError:  0.5184210538864136 is not close to 0.652999997138977 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[64-4096-4096-float16] - AssertionError:  0.13460323214530945 is not close to 0.17000000178813934 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[4096-64-4096-float16] - AssertionError:  0.12149130553007126 is not close to 0.1599999964237213 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[8192-64-8192-float16] - AssertionError:  0.14076729118824005 is not close to 0.2720000147819519 (atol=0.02, rtol=0.01)

Any more thoughts? i'm going to see if i can debug to see if something really did change with the caching, but that seems so unlikely...

@ThomasRaoux
Copy link
Collaborator

@ptillet. I looked at the failure log some more now that i know its specific to my change, and i admit, i'm even more confused. The error looks like its indicating that with my changes the A100 results got faster?!?

Looking at test_performance.py the first value is the current result and the second value is the reference result

test_performance.py:

triton.testing.assert_close(cur_gpu_util, ref_gpu_util, atol=0.02, rtol=0.01)

Results:

=========================== short test summary info ============================
FAILED test_performance.py::test_matmul[1024-1024-1024-float16] - AssertionError:  0.2981213927268982 is not close to 0.35499998927116394 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[2048-2048-2048-float16] - AssertionError:  0.5184210538864136 is not close to 0.652999997138977 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[64-4096-4096-float16] - AssertionError:  0.13460323214530945 is not close to 0.17000000178813934 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[4096-64-4096-float16] - AssertionError:  0.12149130553007126 is not close to 0.1599999964237213 (atol=0.02, rtol=0.01)
FAILED test_performance.py::test_matmul[8192-64-8192-float16] - AssertionError:  0.14076729118824005 is not close to 0.2720000147819519 (atol=0.02, rtol=0.01)

Any more thoughts? i'm going to see if i can debug to see if something really did change with the caching, but that seems so unlikely...

Those numbers are utilization numbers (speed / speed of light) so higher is better which means this points to a regression. Is it possible that there is extra runtime overhead with your changes? A simple CPU profile would usually make it obvious.

@manbearian
Copy link
Collaborator Author

Those numbers are utilization numbers (speed / speed of light) so higher is better which means this points to a regression. Is it possible that there is extra runtime overhead with your changes? A simple CPU profile would usually make it obvious.

Okay, thanks for clearing that up. There shouldn't be any changes here for Nvidia at all, and non-impactful change for non-Nvidia targets. I plan on stepping through the compiler on Monday to figure out what i've done wrong.

def make_hash(fn, target, env_vars, **kwargs):
def make_hash(fn, target, env_vars, device_backend, **kwargs):
if device_backend is None:
version_key = get_cuda_version_key()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if the problem comes from here, this will hash the triton library everytime which is not cheap.

@manbearian
Copy link
Collaborator Author

manbearian commented Oct 14, 2023 via email

@manbearian manbearian force-pushed the dev/ianb/hash-change branch from b69407e to 3dacee5 Compare October 16, 2023 16:14
@manbearian manbearian force-pushed the dev/ianb/hash-change branch from 36289f6 to 5d0ef2d Compare October 17, 2023 14:57
@ptillet ptillet merged commit 768fc1f into triton-lang:main Oct 17, 2023
pingzhuu pushed a commit to siliconflow/triton that referenced this pull request Apr 2, 2024
I noticed that Triton is using the `ptxas` version as part of the
version hash even for non-CUDA targets. This is an attempt at fixing
this. Moving the version calculation to the back-end makes sense to me
from an architectural standpoint, so that's my approach here. I'm not as
confident in the implementation, so please if folks have any feedback
let me know.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants