Skip to content

[RFC]: Data Parallel Attention and Expert Parallel MoEs #16037

Open
1 of 1 issue completed
Open
@tlrmchlsmth

Description

@tlrmchlsmth

Motivation.

This RFC tracks both DP and EP together, since these is a strong intersection between these features.

A key motivation for DP attention is reduced memory footprint in MLA models like DeepSeek V2, V3, and R1. If Tensor Parallelism is used in an MLA model, we duplicate the KV cache across GPUs, wasting memory. This reduces our batch size for throughput and high-QPS serving use cases, which kills our performance. To solve this, we need to parallelize attention across GPUs differently. In this RFC we are talking about request-level parallelism for this.

EP is Expert Parallelism for MoEs, where experts are distributed across EP ranks, and tokens are dispatched to the GPUs holding the experts that the token is routed to. This is in contrast to Tensor Parallel, which shards each expert. The main advantage is lower data movement, especially for large numbers of GPUs, where each token only needs to be sent to a subset of EP ranks. This also can be helpful so that the underlying grouped GEMM kernels operate on tensors with a better aspect ratio.

Proposed Change.

DP attention design doc.

Roadmap.

CC List.

@comaniac @njhill @bnellnm

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions