-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8 #3389
Conversation
vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json
Show resolved
Hide resolved
vllm/model_executor/layers/fused_moe/configs/E=8,N=1792,device_name=NVIDIA_A100-SXM4-80GB.json
Outdated
Show resolved
Hide resolved
Thanks for the refactorings! While you are touching this code, one thing that would be wonderful to do is to keep track of the timings for the best configuration for each batch size. This could e.g. be done by writing them to a separate file. This would allow you to decide if a new configuration is better than the old one. Also note that running the script as-is will likely not produce optimal results in some settings, since there is a bunch of parameter pruning going on at the moment (e.g. for the batch size). Sometimes it is important to look at the values found and then expand the search space if it runs into the boundaries :) |
@pcmoritz This manual kernel tuning is kind of temporary. Going forward, we plan to use |
"2048": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 16, "num_warps": 4, "num_stages": 4}, | ||
"3072": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 1, "num_warps": 8, "num_stages": 4}, | ||
"4096": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 16, "num_warps": 8, "num_stages": 4} | ||
"1": { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The configuration here is quite different from the one we have right now. Could you compare the old and new ones by benchmarking the end-to-end performance (e.g., using benchmark_throughput.py
on Mixtral)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, it'd be nice if you can benchmark other configs as well, if not all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iirc, LLaMA models don't use MoE. Do you mean mixtral
models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep I mean mixtral, not llama
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you have some common setting for me to run the throughput test? Otherwise I'm blindly running python benchmarks/benchmark_throughput.py --input-len 100 --output-len 100
. Not sure if input 100 tokens and output 100 tokens are the cases people care the most.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will use python benchmarks/benchmark_throughput.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50
from #2293 (comment) .
In my experience triton.autotune is far too slow to be useful (unless the configs have already been run / are cached) :) |
command: H100 GPU:
A100 GPU: TODO (don't have 8*A100 GPU at hand now) @WoosukKwon benchmarking results are quite promising! |
Will definitely try to cache tuned configs! |
@youkaichao Awesome! Could you 1) update the PR with the current main and 2) fix the lint error by running |
@WoosukKwon lint is good now 👌 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the PR! Excited about the performance improvement!
… tune moe kernel in A100/H100 with tp=2,4,8 (vllm-project#3389)
… tune moe kernel in A100/H100 with tp=2,4,8 (vllm-project#3389)
cc @WoosukKwon