[Misc] Use torch.compile for basic custom ops #7110

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Closed

WoosukKwon wants to merge 16 commits into main from torch-compile-layers

Collaborator

WoosukKwon commented Aug 3, 2024 •

edited

Loading

This PR introduces torch.compile for the following basic custom ops: activations and RMSNorm.

The main goals are:

Reduce the number of custom kernels maintained by vLLM. (I intentionally kept the kernel for now, but will delete the unused ones once people agree with the direction of this PR) Previous attempt: [Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446
Optimize GemmaRMSNorm. This leads to 20% throughput improvement for Gemma2-27B on 1xH100.

WoosukKwon added 4 commits

August 3, 2024 03:46


          [Misc] Use torch.compile for simple custom ops

703ed4d


          Comment

da9a05d


          Add guard

a734ebc


          Fix commandr

13ffeb9

github-actions bot commented Aug 3, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

Collaborator Author

WoosukKwon commented Aug 3, 2024

cc @youkaichao @bigPYJ1151 @bnellnm

youkaichao reviewed

View reviewed changes

vllm/model_executor/custom_op.py Outdated Show resolved Hide resolved

bigPYJ1151 reviewed

View reviewed changes

Contributor

bigPYJ1151 left a comment

Hi @WoosukKwon this PR looks good! I have verified it on the CPU backend and it worked well w/wo the multiprocessing.

To make this PR work on the CPU backend, we should add triton >= 3.0.0 in requirements-cpu.txt to avoid an import error. It looks like a bug of torch 2.4

vllm/model_executor/custom_op.py Outdated

+                      """
+                      if not self._is_compiled and not envs.VLLM_TEST_DYNAMO_GRAPH_CAPTURE:
+                          self.forward_static = torch.compile(  # type: ignore
+                              self.forward_static,

Contributor

bigPYJ1151 Aug 5, 2024

Setting dynamic=True explicitly can reduce recompilations, because of the dynamic batchsize. Maybe the cuda is similar.

vllm/model_executor/custom_op.py Outdated

+                          self.forward_static = torch.compile(  # type: ignore
+                              self.forward_static,
+                              options={
+                                  "fx_graph_cache": True,

Contributor

bigPYJ1151 Aug 5, 2024

Suggested change

"fx_graph_cache": True,

This option causes lock contention when using multiprocessing.

Contributor

jon-chuang Aug 8, 2024 •

edited

Loading

Maybe you can set a per-process fx_graph_cache

You can set the env var TORCHINDUCTOR_CACHE_DIR

See: https://pytorch.org/tutorials/recipes/torch_compile_caching_tutorial.html

Collaborator Author

WoosukKwon Aug 19, 2024

Removed. I think we can explore caching in a future PR.

bnellnm reviewed

View reviewed changes

vllm/model_executor/custom_op.py Outdated Show resolved Hide resolved

bnellnm reviewed

View reviewed changes

vllm/model_executor/layers/activation.py Outdated

Comment on lines 26 to 27

		@staticmethod
		def forward_static(x: torch.Tensor) -> torch.Tensor:

Contributor

bnellnm Aug 5, 2024 •

edited

Loading

Does this completely eliminate the custom silu_and_mul kernel? If so, should it be removed from csrc?

Ditto for the rest of the custom activation ops.

Collaborator Author

WoosukKwon Aug 9, 2024

That's a good question. I think we can delete most of them, while leaving some (e.g., in csrc/legacy) for potential future use?

anijain2305 reviewed

View reviewed changes

vllm/model_executor/custom_op.py Outdated Show resolved Hide resolved

hmellor mentioned this pull request

[RFC] Initial Support for CPUs #3654

Closed

4 tasks

jon-chuang reviewed

View reviewed changes

vllm/model_executor/layers/layernorm.py Outdated Show resolved Hide resolved

jon-chuang reviewed

View reviewed changes

vllm/model_executor/layers/rotary_embedding.py Outdated

+                  ) -> Tuple[torch.Tensor, torch.Tensor]:
+                      # forward_native() is too complex to be optimized by torch.compile.
+                      # Fall back to the custom C++ kernel.
+                      return self.forward_cuda(positions, query, key, offsets)

Contributor

jon-chuang Aug 8, 2024

Isn't it weird that forward_cpu calls into forward_cuda? 🤔

Contributor

jon-chuang Aug 8, 2024

Maybe rename that if it dispatches to C++ CPU kernel as well?

WoosukKwon added 12 commits

August 9, 2024 13:46


          Merge branch 'main' into torch-compile-layers

ee6c614


          Remove forward_static

599f668


          yapf

89288d7


          Merge branch 'main' into torch-compile-layers

959eca7


          Options

7d49e81


          yapf

b72ced7


          Move flag

17faa39


          Fix comment

df525f3


          Merge branch 'main' into torch-compile-layers

4fbfac7


          Add back forward_cuda

ba2c5b7


          Add back forward_cuda

c0cc352


          Fix RoPE

37ea0c1

WoosukKwon mentioned this pull request

[Misc] Use torch.compile for GemmaRMSNorm #7642

Merged

hmellor mentioned this pull request

[Hardware][Intel] Generate custom activation ops using torch.compile for CPU backend. #5446

Closed

Member

hmellor commented Sep 12, 2024

@WoosukKwon can we expect this PR to merge soon? Is there anything we can do to help it get merged?

Member

hmellor commented Nov 6, 2024

@youkaichao has this been superseded by your torch.compile work?

WoosukKwon closed this

WoosukKwon deleted the torch-compile-layers branch

November 7, 2024 19:07

Collaborator Author

WoosukKwon commented Nov 7, 2024

@hmellor Thanks for bringing this up again. Yeah I think we don't need this PR anymore since our torch.compile integration is more mature now. I think we could turn it on by default after few weeks? cc @youkaichao

Member

hmellor commented Nov 7, 2024

Thanks for the update!

When torch compile gets enabled by default, could we remove custom CPU ops that this PR intended to replace with torch compile equivalents?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

jon-chuang jon-chuang left review comments

anijain2305 anijain2305 left review comments

youkaichao youkaichao left review comments

bigPYJ1151 bigPYJ1151 left review comments

bnellnm bnellnm left review comments

Labels

None yet