Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FP8 splitgemm user defined triton kernel #263

Merged
merged 8 commits into from
May 23, 2024
Merged

Conversation

msaroufim
Copy link
Member

@msaroufim msaroufim commented May 22, 2024

$ python test_fp8.py 
..
----------------------------------------------------------------------
Ran 2 tests in 8.207s

This is borrowing the kernel from https://github.com/pytorch-labs/applied-ai/blob/main/kernels/triton/inference/fp8/splitk_gemm_fp8.py

  1. Added a test case for numerics
  2. Added a test for user defined triton functions to guarantee this does not graph break
  3. I had to reduce block sizes otherwise I was running into OOMs on H100
  4. There's still a bunch we could do with an autotuner
  5. Baseline should also be how much of this we can codgenerate via compile

We had an issue with testing this in our CI but it works fine locally and in pytorch CI pytorch/pytorch#126982

Copy link

pytorch-bot bot commented May 22, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/263

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 113ecde with merge base 5e28109 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label May 22, 2024
@msaroufim
Copy link
Member Author

This is passing locally on nightlies but failing CI, I'll check what's up - I suspect it's some caching issue with compile

@msaroufim msaroufim requested a review from oulgen May 23, 2024 15:47
@msaroufim msaroufim merged commit cb8b651 into main May 23, 2024
13 checks passed
@msaroufim msaroufim deleted the msaroufim/fp8splitk branch May 23, 2024 20:56
@vkuzo
Copy link
Contributor

vkuzo commented May 24, 2024

this is interesting, wondering if we have benchmarks / accuracy data on how this compares to cuBLAS float8 gemm?

@msaroufim
Copy link
Member Author

I don't believe the announcement baselined vs cuBLAS but perhaps @AdnanHoque can shed some more detail

@AdnanHoque
Copy link

Hey! Yes we did compare to cuBLAS and got gain in the small M <= 16 regime.

Here were the speedups (all NCU timings) with the original tile sizes, on H100-SXM:

Screenshot 2024-05-24 at 4 02 12 PM

dbyoung18 pushed a commit to dbyoung18/ao that referenced this pull request Jul 31, 2024
* FP8 splitgemm user defined triton kernel

* yolo

* Trigger CI

* yolo

* yolo

* yolo

* yolo

* Update test_fp8.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants