Add support for EP to context parallelism in self-attention #2023

shuningjin · 2025-07-25T17:49:21Z

Description

Goal

For mixture of expert models, we may use expert parallelism. For attention layer, EP acts as FSDP currently. Built upon previous context parallelism work, this PR is to introduce the option of using EP as CP for attention. This is joint effort with @RissyRan.

FIXES: b/418396648

Main code changes

attentions.py
- MHA, MLA, tpu_flash_attention
- changed logical_axis
base.yml
- change logical_axis_rules
- add a flag to customize expert sharding behavior in attention expert_shard_attention_option: fsdp or context
unit test: tests.attention_test
- AttentionTest.test_tpu_flash_attention_cp_and_ep & MLATest.test_tpu_flash_attention_cp_and_ep (extended from cp test)
- using cp/ep with tpu_flash_attention, with or without context_parallel_load_balance
- compare logit against dot_product without sharding

Use case

training, MoE with MHA / MLA attention
cp_load_balance={true, false}, tpu_flash_attention
Example (ep_as_cp): ici_expert_parallelism=4, ici_context_parallelism=1, expert_shard_attention_option=context, shard context by 4 in attention, shard expert by 4 for moe
Example (ep_as_cp + native cp): ici_expert_parallelism=2, ici_context_parallelism=2, expert_shard_attention_option=context, shard context by 4 in attention, shard expert by 2 and context by 2 for moe

Tests

Tested on v5p-8

Verify sharding shape

end-to-end pretraining: reduced version of Mixtral-8x7b (for MHA) and reduced DeepSeek3-671b (for MLA)
tpu_flash_kernel, context_parallel_load_balance=True, parallelism: ici_expert_parallelism=2, ici_context_parallelism=2, expert_shard_attention_option=context and ici_expert_parallelism=2, ici_context_parallelism=2, expert_shard_attention_option=fsdp
See test details in b/418396648#comment7

Verify attention output logit against dot product

use the added unit test
tpu_flash_kernel, attention {MHA, MLA}, context_parallel_load_balance={True, False}, parallelism: {ici_expert_parallelism=4, expert_shard_attention_option=context, ici_expert_parallelism=2, expert_shard_attention_option=context, ici_context_parallelism=2}

Checklist

Before submitting this PR, please make sure (put X in square brackets):

I have performed a self-review of my code.
I have necessary comments in my code, particularly in hard-to-understand areas.
I have run end-to-end tests tests and provided workload links above if applicable.
I have made or will make corresponding changes to the doc if needed.

RissyRan

Thanks Shuning! Great work!

MaxText/configs/base.yml

MaxText/layers/attentions.py

MaxText/layers/moe.py

MaxText/pyconfig.py

MaxText/tests/attention_test.py

MaxText/configs/base.yml

MaxText/pyconfig.py

gobbleturk

Have you considered an approach like conditionally modifying the rules (instead of creating new ones?) This is an approach used for pipeline parallelism

maxtext/MaxText/pyconfig.py

Line 788 in fdf479f

def modify_activation_embed_and_logits_batch(logical_axis_rules):

there are pros and cons of both, both pretty ugly IMO but at least when modifying rules there are

less rules
no if statements which rules to use

richjames0

Really impressive that you understood this and got it working!

MaxText/configs/base.yml

MaxText/layers/attentions.py

MaxText/layers/moe.py

MaxText/tests/attention_test.py

shuningjin · 2025-08-08T01:48:25Z

Resolved merge conflict with nnx migration for attention layer

Re-testing on local v5p-8, diff (before vs. after nnx migration)

Mini mixtral: 1.1, 1.2. Mini deepseek3: 2.1, 2.2
Sharding shape is still correct, training loss is close
TFLOP/s/device is better now. In both cases, the FLOPs calculation has included the recent change. Possible reasons: different docker image (jax 0.6.2 vs. 0.7.0), linen vs. nnx.

shuningjin force-pushed the shuningjin-ep branch from e719d35 to 18d8769 Compare July 25, 2025 18:59

shuningjin marked this pull request as ready for review July 25, 2025 19:02

shuningjin requested review from A9isha, RissyRan, SurbhiJainUSC, aireenmei, bvandermoon, gagika, gobbleturk, hengtaoguo, khatwanimohit, michelle-yooh, richjames0, shralex, vipannalla and yangyuwei as code owners July 25, 2025 19:02

shuningjin assigned shuningjin, RissyRan, A9isha and gobbleturk and unassigned shuningjin Jul 25, 2025

shuningjin force-pushed the shuningjin-ep branch 3 times, most recently from 9e07f65 to 9eb87ff Compare July 28, 2025 19:11

RissyRan reviewed Jul 29, 2025

View reviewed changes

shuningjin force-pushed the shuningjin-ep branch from e943bff to b5241dc Compare July 29, 2025 21:47

gobbleturk reviewed Aug 1, 2025

View reviewed changes

MaxText/configs/base.yml Outdated Show resolved Hide resolved

RissyRan reviewed Aug 1, 2025

View reviewed changes

MaxText/pyconfig.py Show resolved Hide resolved

RissyRan approved these changes Aug 1, 2025

View reviewed changes

gobbleturk reviewed Aug 1, 2025

View reviewed changes

shuningjin assigned richjames0 Aug 4, 2025

gobbleturk approved these changes Aug 4, 2025

View reviewed changes

richjames0 reviewed Aug 5, 2025

View reviewed changes

MaxText/configs/base.yml Show resolved Hide resolved

MaxText/layers/attentions.py Outdated Show resolved Hide resolved

MaxText/layers/moe.py Show resolved Hide resolved

MaxText/tests/attention_test.py Show resolved Hide resolved

shuningjin requested a review from NuojCheng as a code owner August 5, 2025 22:01

shuningjin unassigned A9isha, richjames0, RissyRan and gobbleturk Aug 6, 2025

shuningjin force-pushed the shuningjin-ep branch 5 times, most recently from a9f3ea9 to 40f8ad2 Compare August 7, 2025 21:16

Enable EP as context parallelism in attention

251a4ce

shuningjin force-pushed the shuningjin-ep branch from 3dd6196 to 251a4ce Compare August 8, 2025 01:05

github-actions bot added the pull ready label Aug 8, 2025

copybara-service bot merged commit b98c47a into main Aug 8, 2025
18 checks passed

copybara-service bot deleted the shuningjin-ep branch August 8, 2025 05:25

shuningjin mentioned this pull request Aug 29, 2025

Correct length in sharding #2262

Closed

4 tasks

gobbleturk mentioned this pull request Sep 8, 2025

Fix vocab sharding constraints #2311

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for EP to context parallelism in self-attention #2023

Add support for EP to context parallelism in self-attention #2023

Uh oh!

shuningjin commented Jul 25, 2025 •

edited

Loading

Uh oh!

RissyRan left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gobbleturk left a comment

Uh oh!

richjames0 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuningjin commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Add support for EP to context parallelism in self-attention #2023

Add support for EP to context parallelism in self-attention #2023

Uh oh!

Conversation

shuningjin commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Tests

Checklist

Uh oh!

RissyRan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gobbleturk left a comment

Choose a reason for hiding this comment

Uh oh!

richjames0 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shuningjin commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

shuningjin commented Jul 25, 2025 •

edited

Loading

shuningjin commented Aug 8, 2025 •

edited

Loading