[feat]support ep_fsdp by kevssim · Pull Request #71 · modelscope/twinkle

kevssim · 2026-02-24T02:23:37Z

PR type

Bug Fix
New Feature
Document Updates
More Models or Datasets Support

PR information

support ep_fsdp.

Experiment results

Paste your experiment result here(if needed).

gemini-code-assist · 2026-02-24T02:23:58Z

Summary of Changes

Hello @kevssim, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements support for a combined Expert Parallelism and Fully Sharded Data Parallel (EP+FSDP) strategy, specifically targeting Mixture-of-Experts (MoE) models. The changes enable more efficient distributed training by allowing experts to be sharded across a dedicated FSDP mesh while maintaining existing expert parallelism. This involves updates to expert sharding, integration into the FSDP wrapping mechanism, and improvements to gradient clipping for complex distributed tensor configurations.

Highlights

Introduced EP+FSDP Support for MoE Models: Enabled a new distributed training strategy that combines Expert Parallelism (EP) with Fully Sharded Data Parallel (FSDP) for Mixture-of-Experts (MoE) models.
Enhanced Expert Sharding Logic: Modified the expert_parallel.py module to conditionally handle expert execution and sharding based on whether EP+FSDP is enabled, including a new batch processing function for this mode.
Integrated EP+FSDP into FSDP Strategy: Updated the native_fsdp.py module to detect and apply EP+FSDP, including building a dedicated ep_fsdp_mesh and selectively sharding expert blocks using FSDP.
Improved Gradient Clipping for Distributed Tensors: Refined the grad_clip.py utility to correctly handle gradient clipping in scenarios involving mixed DTensor meshes, ensuring proper reduction across different device meshes.
Extended DeviceMesh Functionality: Added new methods to DeviceMesh in platform.py to facilitate the detection of implicit EP+FSDP mode and to retrieve ranks for specific dimensions, crucial for constructing the ep_fsdp_mesh.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

src/twinkle/metric/train_metric.py
- Updated f-string formatting for learning rate and total time elapse.
src/twinkle/model/transformers/moe/expert_parallel.py
- Added ep_fsdp_enabled flag and passed it to shard_experts.
- Introduced a check for nn.ModuleList experts not supported with EP+EP_FSDP.
- Stored _ep_fsdp_enabled on the block.
- Implemented conditional expert execution in forward to use _run_experts_ep_fsdp_batch for EP+EP_FSDP.
- Modified _run_expert to directly call experts.forward when EP+EP_FSDP is enabled.
- Added _run_experts_ep_fsdp_batch function for batch processing of experts in EP+EP_FSDP mode.
src/twinkle/model/transformers/strategy/native_fsdp.py
- Added _is_ep_fsdp_mode_enabled to check for EP+FSDP.
- Introduced _build_ep_fsdp_mesh to create a dedicated mesh for EP+FSDP.
- Added _ensure_ep_fsdp_supported to validate expert types for EP+FSDP.
- Implemented _maybe_shard_ep_expert_blocks to apply FSDP sharding to expert blocks on the ep_fsdp_mesh.
- Integrated EP+FSDP setup into the wrap_model method.
src/twinkle/utils/grad_clip.py
- Modified normalize_and_clip_grad_norm to detect and handle has_mixed_dtensor_mesh.
- Adjusted _local_grad to set reduce_group to None for mixed DTensor meshes, forcing world reduction over local shards.
src/twinkle/utils/platform.py
- Added get_ranks_for_dims method to retrieve ranks for specified dimensions.
- Implemented is_implicit_ep_fsdp_enabled to check if implicit EP+FSDP mode is active based on world sizes.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces support for EP+FSDP, a combination of expert parallelism and fully sharded data parallelism. The changes are spread across model parallelism logic, strategy definitions, and utilities for device mesh and gradient clipping. The implementation provides a new method for sharding and executing experts under this combined parallelism scheme. My review highlights two main concerns: a potential dead code path in the expert execution logic which could lead to confusion, and a restrictive assumption in a device mesh utility function that might limit its use with more complex distributed configurations.

src/twinkle/model/transformers/moe/expert_parallel.py

gemini-code-assist · 2026-02-24T02:26:18Z

src/twinkle/utils/platform.py

+    def is_implicit_ep_fsdp_enabled(self) -> bool:
+        ep_world_size = self.ep_world_size or 1
+        dp_world_size = self.dp_world_size or 1
+        if ep_world_size <= 1 or dp_world_size <= 1:
+            return False
+
+        world_size = self.world_size or 1
+        if world_size % ep_world_size != 0:
+            raise ValueError(f'world_size ({world_size}) must be divisible by ep_world_size ({ep_world_size}) '
+                             'to infer implicit EP_FSDP from dp.')
+        expected_dp_size = world_size // ep_world_size
+        if dp_world_size != expected_dp_size:
+            raise ValueError(f'Implicit EP_FSDP requires dp_world_size == world_size // ep_world_size, '
+                             f'but got dp_world_size={dp_world_size}, world_size={world_size}, '
+                             f'ep_world_size={ep_world_size}.')
+        return True


The logic in is_implicit_ep_fsdp_enabled seems to assume that the device mesh consists only of dp and ep dimensions. The check dp_world_size != world_size // ep_world_size will likely fail for more complex meshes that also include other parallelism dimensions like tensor parallelism (tp) or pipeline parallelism (pp), as it would incorrectly require tp_size * pp_size == 1. This could be a potential limitation or bug. Please consider making the logic more robust to handle arbitrary mesh dimensions or clearly documenting this limitation.

kevssim added 2 commits February 24, 2026 10:15

wip

ae30102

lint

181edbd

gemini-code-assist bot reviewed Feb 24, 2026

View reviewed changes

wip

0358d97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat]support ep_fsdp#71

[feat]support ep_fsdp#71
kevssim wants to merge 3 commits intomodelscope:mainfrom
kevssim:ep_fsdp

kevssim commented Feb 24, 2026

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

gemini-code-assist bot Feb 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kevssim commented Feb 24, 2026

PR type

PR information

Experiment results

Uh oh!

gemini-code-assist bot commented Feb 24, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

gemini-code-assist bot Feb 24, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant