Hi, I have a question about the dispatch warp limit in the elastic dispatch path.
In csrc/kernels/elastic/dispatch.hpp, num_dispatch_warps is limited by:
math::ceil_div(512, num_sms)
so the total number of dispatch/sender warps is kept around 512.
I tried removing this limit and observed that performance can degrade when using more SMs. For example, in a single-node NVLink setup (Ranks: 1 x 8), increasing num_sms caused dispatch bandwidth to drop instead of improve.
Do you know the main reason for this 512-warp limit? Is it due to a specific bottleneck, such as TMA/proxy engine pressure, NVLink/L2 contention, atomic counter contention, grid synchronization overhead, or something else? Or was 512 chosen mainly as an empirical tuning value based on benchmarking?
Any insight would be appreciated. Thanks!
Hi, I have a question about the dispatch warp limit in the elastic dispatch path.
In
csrc/kernels/elastic/dispatch.hpp,num_dispatch_warpsis limited by:so the total number of dispatch/sender warps is kept around 512.
I tried removing this limit and observed that performance can degrade when using more SMs. For example, in a single-node NVLink setup (
Ranks: 1 x 8), increasingnum_smscaused dispatch bandwidth to drop instead of improve.Do you know the main reason for this 512-warp limit? Is it due to a specific bottleneck, such as TMA/proxy engine pressure, NVLink/L2 contention, atomic counter contention, grid synchronization overhead, or something else? Or was 512 chosen mainly as an empirical tuning value based on benchmarking?
Any insight would be appreciated. Thanks!