-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Change hash to not require ptxas #2476
Conversation
I believe the regression is an actual one, and not just noise. Not sure where it comes from though. |
Yes; it does appear that way... i was sitting on this hoping other PRs might hit similar issues, but nothing has. I'll investigate when i have chance; this PR isn't exactly high priority. |
@ptillet. I looked at the failure log some more now that i know its specific to my change, and i admit, i'm even more confused. The error looks like its indicating that with my changes the A100 results got faster?!? Looking at
triton.testing.assert_close(cur_gpu_util, ref_gpu_util, atol=0.02, rtol=0.01) Results:
Any more thoughts? i'm going to see if i can debug to see if something really did change with the caching, but that seems so unlikely... |
Those numbers are utilization numbers (speed / speed of light) so higher is better which means this points to a regression. Is it possible that there is extra runtime overhead with your changes? A simple CPU profile would usually make it obvious. |
Okay, thanks for clearing that up. There shouldn't be any changes here for Nvidia at all, and non-impactful change for non-Nvidia targets. I plan on stepping through the compiler on Monday to figure out what i've done wrong. |
def make_hash(fn, target, env_vars, **kwargs): | ||
def make_hash(fn, target, env_vars, device_backend, **kwargs): | ||
if device_backend is None: | ||
version_key = get_cuda_version_key() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if the problem comes from here, this will hash the triton library everytime which is not cheap.
Thank you for the quick look. I hadn't considered the cost of the hash. I will review and make sure my changes are neutral with respect to time in addition to correctness
Sent from my Verizon, Samsung Galaxy smartphone
Get Outlook for Android<https://aka.ms/AAb9ysg>
________________________________
From: Thomas Raoux ***@***.***>
Sent: Friday, October 13, 2023 5:26:42 PM
To: openai/triton ***@***.***>
Cc: Ian Bearman ***@***.***>; Author ***@***.***>
Subject: Re: [openai/triton] Change hash to not require ptxas (PR #2476)
@ThomasRaoux commented on this pull request.
________________________________
In python/triton/compiler/compiler.py<#2476 (comment)>:
@@ -234,7 +234,11 @@ def convert_type_repr(x):
return x
…-def make_hash(fn, target, env_vars, **kwargs):
+def make_hash(fn, target, env_vars, device_backend, **kwargs):
+ if device_backend is None:
+ version_key = get_cuda_version_key()
I'm wondering if the problem comes from here, this will hash the triton library everytime which is not cheap.
—
Reply to this email directly, view it on GitHub<#2476 (review)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AF7M3YV73BD6TDOEN47FZMDX7HL4FAVCNFSM6AAAAAA5ZPRCT2VHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTMNZXGU3TIOBXHE>.
You are receiving this because you authored the thread.Message ID: ***@***.***>
|
b69407e
to
3dacee5
Compare
36289f6
to
5d0ef2d
Compare
5d0ef2d
to
4a3b9e6
Compare
I noticed that Triton is using the `ptxas` version as part of the version hash even for non-CUDA targets. This is an attempt at fixing this. Moving the version calculation to the back-end makes sense to me from an architectural standpoint, so that's my approach here. I'm not as confident in the implementation, so please if folks have any feedback let me know.
I noticed that Triton is using the
ptxas
version as part of the version hash even for non-CUDA targets. This is an attempt at fixing this. Moving the version calculation to the back-end makes sense to me from an architectural standpoint, so that's my approach here. I'm not as confident in the implementation, so please if folks have any feedback let me know.