[Kernel] change benchmark script so that result can be directly used; tune moe kernel in A100/H100 with tp=2,4,8#3389
Conversation
|
Thanks for the refactorings! While you are touching this code, one thing that would be wonderful to do is to keep track of the timings for the best configuration for each batch size. This could e.g. be done by writing them to a separate file. This would allow you to decide if a new configuration is better than the old one. Also note that running the script as-is will likely not produce optimal results in some settings, since there is a bunch of parameter pruning going on at the moment (e.g. for the batch size). Sometimes it is important to look at the values found and then expand the search space if it runs into the boundaries :) |
|
@pcmoritz This manual kernel tuning is kind of temporary. Going forward, we plan to use |
| "2048": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 128, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 16, "num_warps": 4, "num_stages": 4}, | ||
| "3072": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 1, "num_warps": 8, "num_stages": 4}, | ||
| "4096": {"BLOCK_SIZE_M": 128, "BLOCK_SIZE_N": 256, "BLOCK_SIZE_K": 32, "GROUP_SIZE_M": 16, "num_warps": 8, "num_stages": 4} | ||
| "1": { |
There was a problem hiding this comment.
The configuration here is quite different from the one we have right now. Could you compare the old and new ones by benchmarking the end-to-end performance (e.g., using benchmark_throughput.py on Mixtral)?
There was a problem hiding this comment.
Also, it'd be nice if you can benchmark other configs as well, if not all.
There was a problem hiding this comment.
iirc, LLaMA models don't use MoE. Do you mean mixtral models?
There was a problem hiding this comment.
Yep I mean mixtral, not llama
There was a problem hiding this comment.
do you have some common setting for me to run the throughput test? Otherwise I'm blindly running python benchmarks/benchmark_throughput.py --input-len 100 --output-len 100. Not sure if input 100 tokens and output 100 tokens are the cases people care the most.
There was a problem hiding this comment.
I will use python benchmarks/benchmark_throughput.py --model=mistralai/Mixtral-8x7B-Instruct-v0.1 --input-len 1000 --output-len 50 from #2293 (comment) .
In my experience triton.autotune is far too slow to be useful (unless the configs have already been run / are cached) :) |
|
command: H100 GPU:
A100 GPU: TODO (don't have 8*A100 GPU at hand now) @WoosukKwon benchmarking results are quite promising! |
Will definitely try to cache tuned configs! |
|
@youkaichao Awesome! Could you 1) update the PR with the current main and 2) fix the lint error by running |
|
@WoosukKwon lint is good now 👌 |
… tune moe kernel in A100/H100 with tp=2,4,8 (vllm-project#3389)
cc @WoosukKwon