Skip to content

[feat]support ep_fsdp#71

Draft
kevssim wants to merge 3 commits intomodelscope:mainfrom
kevssim:ep_fsdp
Draft

[feat]support ep_fsdp#71
kevssim wants to merge 3 commits intomodelscope:mainfrom
kevssim:ep_fsdp

Conversation

@kevssim
Copy link
Collaborator

@kevssim kevssim commented Feb 24, 2026

PR type

  • Bug Fix
  • New Feature
  • Document Updates
  • More Models or Datasets Support

PR information

support ep_fsdp.

Experiment results

Paste your experiment result here(if needed).

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kevssim, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request implements support for a combined Expert Parallelism and Fully Sharded Data Parallel (EP+FSDP) strategy, specifically targeting Mixture-of-Experts (MoE) models. The changes enable more efficient distributed training by allowing experts to be sharded across a dedicated FSDP mesh while maintaining existing expert parallelism. This involves updates to expert sharding, integration into the FSDP wrapping mechanism, and improvements to gradient clipping for complex distributed tensor configurations.

Highlights

  • Introduced EP+FSDP Support for MoE Models: Enabled a new distributed training strategy that combines Expert Parallelism (EP) with Fully Sharded Data Parallel (FSDP) for Mixture-of-Experts (MoE) models.
  • Enhanced Expert Sharding Logic: Modified the expert_parallel.py module to conditionally handle expert execution and sharding based on whether EP+FSDP is enabled, including a new batch processing function for this mode.
  • Integrated EP+FSDP into FSDP Strategy: Updated the native_fsdp.py module to detect and apply EP+FSDP, including building a dedicated ep_fsdp_mesh and selectively sharding expert blocks using FSDP.
  • Improved Gradient Clipping for Distributed Tensors: Refined the grad_clip.py utility to correctly handle gradient clipping in scenarios involving mixed DTensor meshes, ensuring proper reduction across different device meshes.
  • Extended DeviceMesh Functionality: Added new methods to DeviceMesh in platform.py to facilitate the detection of implicit EP+FSDP mode and to retrieve ranks for specific dimensions, crucial for constructing the ep_fsdp_mesh.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • src/twinkle/metric/train_metric.py
    • Updated f-string formatting for learning rate and total time elapse.
  • src/twinkle/model/transformers/moe/expert_parallel.py
    • Added ep_fsdp_enabled flag and passed it to shard_experts.
    • Introduced a check for nn.ModuleList experts not supported with EP+EP_FSDP.
    • Stored _ep_fsdp_enabled on the block.
    • Implemented conditional expert execution in forward to use _run_experts_ep_fsdp_batch for EP+EP_FSDP.
    • Modified _run_expert to directly call experts.forward when EP+EP_FSDP is enabled.
    • Added _run_experts_ep_fsdp_batch function for batch processing of experts in EP+EP_FSDP mode.
  • src/twinkle/model/transformers/strategy/native_fsdp.py
    • Added _is_ep_fsdp_mode_enabled to check for EP+FSDP.
    • Introduced _build_ep_fsdp_mesh to create a dedicated mesh for EP+FSDP.
    • Added _ensure_ep_fsdp_supported to validate expert types for EP+FSDP.
    • Implemented _maybe_shard_ep_expert_blocks to apply FSDP sharding to expert blocks on the ep_fsdp_mesh.
    • Integrated EP+FSDP setup into the wrap_model method.
  • src/twinkle/utils/grad_clip.py
    • Modified normalize_and_clip_grad_norm to detect and handle has_mixed_dtensor_mesh.
    • Adjusted _local_grad to set reduce_group to None for mixed DTensor meshes, forcing world reduction over local shards.
  • src/twinkle/utils/platform.py
    • Added get_ranks_for_dims method to retrieve ranks for specified dimensions.
    • Implemented is_implicit_ep_fsdp_enabled to check if implicit EP+FSDP mode is active based on world sizes.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for EP+FSDP, a combination of expert parallelism and fully sharded data parallelism. The changes are spread across model parallelism logic, strategy definitions, and utilities for device mesh and gradient clipping. The implementation provides a new method for sharding and executing experts under this combined parallelism scheme. My review highlights two main concerns: a potential dead code path in the expert execution logic which could lead to confusion, and a restrictive assumption in a device mesh utility function that might limit its use with more complex distributed configurations.

Comment on lines +211 to +226
def is_implicit_ep_fsdp_enabled(self) -> bool:
ep_world_size = self.ep_world_size or 1
dp_world_size = self.dp_world_size or 1
if ep_world_size <= 1 or dp_world_size <= 1:
return False

world_size = self.world_size or 1
if world_size % ep_world_size != 0:
raise ValueError(f'world_size ({world_size}) must be divisible by ep_world_size ({ep_world_size}) '
'to infer implicit EP_FSDP from dp.')
expected_dp_size = world_size // ep_world_size
if dp_world_size != expected_dp_size:
raise ValueError(f'Implicit EP_FSDP requires dp_world_size == world_size // ep_world_size, '
f'but got dp_world_size={dp_world_size}, world_size={world_size}, '
f'ep_world_size={ep_world_size}.')
return True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic in is_implicit_ep_fsdp_enabled seems to assume that the device mesh consists only of dp and ep dimensions. The check dp_world_size != world_size // ep_world_size will likely fail for more complex meshes that also include other parallelism dimensions like tensor parallelism (tp) or pipeline parallelism (pp), as it would incorrectly require tp_size * pp_size == 1. This could be a potential limitation or bug. Please consider making the logic more robust to handle arbitrary mesh dimensions or clearly documenting this limitation.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant