[Feature]: moe_align_block_size optimization opportunity for Expert Parallel case

### 🚀 The feature, motivation and pitch

In the case of expert_parallel, `moe_align_block_size` initially considers all experts as valid and aligns all tokens appropriately. Before the function returns it marks the experts_ids that are not in the current GPU rank as `-1` so the MoE matmuls could skip those blocks.
This is sub-optimal in memory and performance. The solution is to recognize/apply expert_map before or inside `moe_align_block_size` so we allocate less memory do less work. 

### Alternatives

_No response_

### Additional context

Related bugfix PR - https://github.com/vllm-project/vllm/pull/19515 

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the [documentation page](https://docs.vllm.ai/en/latest/), which can answer lots of frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Feature]: moe_align_block_size optimization opportunity for Expert Parallel case #19590

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[Feature]: moe_align_block_size optimization opportunity for Expert Parallel case #19590

Description

🚀 The feature, motivation and pitch

Alternatives

Additional context

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions