Skip to content

promote blocksparse from prototype, make it faster #1734

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 40 commits into from
Feb 19, 2025

Conversation

jcaip
Copy link
Contributor

@jcaip jcaip commented Feb 19, 2025

This PR promotes block sparsity from prototype in torchao.

Chiefly, it ports over the triton addmm blocksparse kernels from core, and makes several performance improvements to them.

All of the numbers reported below are for an H100, with blocksize=64 and sparsity_level=0.9. The default dense baseline is 134 tok/s

  1. Adds padding support to the triton kernel for dense matrices with dimension < 16, like those we run into during decoding. (214 -> 218 tok/s)
  2. Changes the default num_stages parameter from 1 to 4. This has a large effect on performance, and it seemed like the default kernel autotuning either does not modify or deems this parameter to be unimportant for some reason. (218 -> 263 tok/s).
  3. Adds an env_var, BSR_AUTOTUNE, that users can use if they want to do kernel autotuning on top of the default parameters. (263 -> 266 tok/s) This seems to matter more for bs=n compute bound workloads, where I see a reduction from 0.3855 to 0.3745s on bs=8192 prefill (roughly 3%)

So in total we are seeing a 1.985x speedup 🚀

I've also updated the documentation to not reference prototype - planning on updating the diagram in a subsequent PR.

Testing

I added a new test case for the padding inputs and moved the test file out of prototype.

python test/sparsity/test_sparse_api.py

Benchmarking

export CHECKPOINT_PATH=../../../checkpoints # path to checkpoints folder
export MODEL_REPO=meta-llama/Meta-Llama-3.1-8B

python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --prefill_size 8192 --profile baseline_prefill
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --prefill_size 8192 --sparsity bsr --profile bsr_prefill
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --profile baseline
python generate.py --checkpoint_path $CHECKPOINT_PATH/$MODEL_REPO/model.pth --compile --compile_prefill --write_result benchmark_results.txt --sparsity bsr --profile bsr

Copy link

pytorch-bot bot commented Feb 19, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/1734

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Pending

As of commit dd500a4 with merge base 79ac44e (image):

NEW FAILURE - The following job has failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2025
@jcaip jcaip marked this pull request as ready for review February 19, 2025 02:48
@jcaip jcaip added sparsity topic: bc-breaking Use this tag if this PR breaks backward compatibility performance labels Feb 19, 2025
@jcaip jcaip changed the title Jcaip/blocksparse updates promote blocksparse from prototype, make it faster Feb 19, 2025
@vkuzo
Copy link
Contributor

vkuzo commented Feb 19, 2025

might be good to consider getting the changes from #1690 in here since you are making a major API change, it will save you a migration in the future.

@jcaip
Copy link
Contributor Author

jcaip commented Feb 19, 2025

ah yes that's a good idea. I'll open a subsequent PR and update all of sparsity APIs.

@jcaip jcaip merged commit ceceea5 into main Feb 19, 2025
16 of 17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. performance sparsity topic: bc-breaking Use this tag if this PR breaks backward compatibility
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants