-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
wrong output of fp32 Matmul #1190
Comments
What inputs did you use? |
And is it fp32 or tf32? |
I use the default initialization of torch.rand, so I think it is fp32 |
Well, if you are using a GPU that supports tensor cores, the default should be TF32. Let me be more specific on the "inputs". Can you please at least share the M, N, K, and dtype information? It would be even better if you can attach a simple Example: https://github.com/openai/triton/blob/main/python/test/unit/operators/test_matmul.py I do notice that we haven't tested fp32 with tensor core disabled, so maybe there is indeed something wrong |
https://pytorch.org/docs/stable/notes/cuda.html More reference:
|
"_matmul.apply" is directly from https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py I have tried both torch.backends.cuda.matmul.allow_tf32 = False and True
The outcome
|
I don't see obvious errors in this case. Note that the matrices you are testing are large ones, so the accumulated differences are expected to be somehow larger than you thought it would be. BTW, fp32 should be more accurate, but we haven't optimized its performance yet.
|
To verify, please print |
I use NVIDIA A40 ` import numpy as np a = torch.rand(1024, 1024).cuda().float() torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cuda.matmul.allow_tf32 = False Outcomes:
|
So it passed |
@sustcsonglin oh sorry, please remember to add Just to ensure you get deterministic results. It is expected that results differ between each run. |
`import torch import numpy as np torch.backends.cuda.matmul.allow_tf32 = True torch.backends.cuda.matmul.allow_tf32 = False ` The first case passes the torch.allclose(c, c_ref) while the second fails |
In that case, you can set |
Thanks for your patience. Outputs are almost identical when using fp16, and I do not know the reasons for such differences when using fp32. Also i am not sure if rtol=0.01 is too big |
We use different accumulation logics than cuBLAS. It maybe worth investigating into the precisions, but fp32 is not our focus for now since it is way much slower than tf32. |
Thanks |
Conditionally use wall time if IPEX is not present. This allows us to keep more accurate time for our current benchmarks, but also buys us some time to figure out a better solution with the PyTorch team.
I compare the outputs of https://github.com/openai/triton/blob/main/python/triton/ops/matmul.py and torch.matmul
fp 16 works as it is, while for fp32 there is a large difference.
The text was updated successfully, but these errors were encountered: