Open
Description
🚀 The feature, motivation and pitch
In the case of expert_parallel, moe_align_block_size
initially considers all experts as valid and aligns all tokens appropriately. Before the function returns it marks the experts_ids that are not in the current GPU rank as -1
so the MoE matmuls could skip those blocks.
This is sub-optimal in memory and performance. The solution is to recognize/apply expert_map before or inside moe_align_block_size
so we allocate less memory do less work.
Alternatives
No response
Additional context
Related bugfix PR - #19515
Before submitting a new issue...
- Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.