Skip to content

Add hl.wait & AllGather Matmul example (ptx impl). #189

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

joydddd
Copy link
Contributor

@joydddd joydddd commented Jun 16, 2025

Stacked PRs:


Add hl.wait & AllGather Matmul example (ptx impl).

joydddd added a commit that referenced this pull request Jun 16, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd mentioned this pull request Jun 16, 2025
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 16, 2025
@joydddd joydddd changed the title Add & . (ptx impl). [WIP] Add hl.signal & hl.wait (ptx impl). Jun 16, 2025
@joydddd joydddd changed the base branch from joydddd/stack/4 to main June 17, 2025 21:04
joydddd added a commit that referenced this pull request Jun 17, 2025
stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd changed the title [WIP] Add hl.signal & hl.wait (ptx impl). Add & . (ptx impl). Jun 17, 2025
@joydddd joydddd changed the title Add & . (ptx impl). Add tl.wait (ptx impl) Jun 20, 2025
@joydddd joydddd changed the title Add tl.wait (ptx impl) Add hl.wait (ptx impl) Jun 20, 2025
@joydddd
Copy link
Contributor Author

joydddd commented Jun 24, 2025

All Gather Matmul Performance, 8xH100

examples/all_gather_matmul.py

shape dtype nccl torch_symm_mem triton helion Speedup over nccl Best Backend
(256, 6656, 4096) torch.bfloat16 240.576 509.696 274.336 272.864 1.000 nccl
(256, 6656, 8192) torch.bfloat16 481.600 545.568 560.544 527.552 1.000 nccl
(256, 6656, 16384) torch.bfloat16 933.056 1309.664 1256.640 1197.248 1.000 nccl
(256, 6656, 32768) torch.bfloat16 1852.320 2292.416 2669.696 4431.936 1.000 nccl
(512, 6656, 4096) torch.bfloat16 1772.864 1603.072 1532.064 1721.760 1.157 triton
(512, 6656, 8192) torch.bfloat16 2296.832 1596.864 2039.968 1053.984 2.179 helion
(512, 6656, 16384) torch.bfloat16 2951.552 5618.752 4682.240 2415.680 1.222 helion
(512, 6656, 32768) torch.bfloat16 13161.760 6228.160 10848.480 12085.920 2.113 torch_symm_mem
(1024, 6656, 4096) torch.bfloat16 3556.832 2825.600 2733.440 3177.760 1.301 triton
(1024, 6656, 8192) torch.bfloat16 6641.632 3352.672 4088.736 3881.216 1.981 torch_symm_mem
(1024, 6656, 16384) torch.bfloat16 4735.712 10557.312 12627.840 14168.864 1.000 nccl
(1024, 6656, 32768) torch.bfloat16 25345.888 25859.425 27778.080 34090.591 1.000 nccl
(2048, 6656, 4096) torch.bfloat16 6714.240 5583.840 5324.352 6093.664 1.261 triton
(2048, 6656, 8192) torch.bfloat16 12866.528 10761.472 12755.168 16312.288 1.196 torch_symm_mem
(2048, 6656, 16384) torch.bfloat16 25339.424 28606.527 30244.703 37253.922 1.000 nccl
(2048, 6656, 32768) torch.bfloat16 54389.023 62389.664 48514.751 63082.783 1.121 triton

stack-info: PR: #189, branch: joydddd/stack/5
@joydddd joydddd changed the title Add hl.wait (ptx impl) Add hl.wait & AllGather Matmul example (ptx impl). Jun 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants