[Question] Why is elastic dispatch limited to ~512 warps?

Hi, I have a question about the dispatch warp limit in the elastic dispatch path.

In `csrc/kernels/elastic/dispatch.hpp`, `num_dispatch_warps` is limited by:

    math::ceil_div(512, num_sms)

so the total number of dispatch/sender warps is kept around 512.

I tried removing this limit and observed that performance can degrade when using more SMs. For example, in a single-node NVLink setup (`Ranks: 1 x 8`), increasing `num_sms` caused dispatch bandwidth to drop instead of improve.

Do you know the main reason for this 512-warp limit? Is it due to a specific bottleneck, such as TMA/proxy engine pressure, NVLink/L2 contention, atomic counter contention, grid synchronization overhead, or something else? Or was 512 chosen mainly as an empirical tuning value based on benchmarking?

Any insight would be appreciated. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Question] Why is elastic dispatch limited to ~512 warps? #636

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Question] Why is elastic dispatch limited to ~512 warps? #636

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions