Skip to content

Conversation

@ympcMark
Copy link

Motivation

The previous work of this PR only implemented fault tolerance at the EP level, but not at the DP level; we achieved this functionality by maintaining worker information through communication between the Scheduler, TokenizerManager, and DataParallelController.

Modifications

io_struct.py: Define scheduler status information

scheduler.py: Send status information to TokenizerManager.

tokenizer_manager.py: Write a handler function to accept this type of information and send information to DataParallelController.

data_parallel_controller.py: Write a handler function to accept this type of information and maintain worker information.

Accuracy Tests

added a new unit test test/srt/ep/test_mooncake_ep_small.py

Benchmarking and Profiling

Checklist

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from 65a0b1f to 6f302c3 Compare October 23, 2025 02:35
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 4 times, most recently from 2a08f54 to 09c4da0 Compare October 23, 2025 04:06
@ympcMark ympcMark marked this pull request as ready for review October 23, 2025 04:25
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 09c4da0 to 567ada0 Compare October 23, 2025 07:16
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from c13f1fc to e7f23db Compare October 23, 2025 10:08

_TP: Optional[GroupCoordinator] = None
_TP_ACTIVE_RANKS: Optional[torch.Tensor] = None
_TP_ACTIVE_RANKS_CPU: Optional[torch.Tensor] = None
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking, would it be more reasonable to rename the elastic ep module to elstic and group together concepts like _TP_ACTIVE_RANKS and _TP_ACTIVE_RANKS_CPU as much as possible?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make it less intrusive to the original logic

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds reasonable

@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from d03f5e2 to 9959c80 Compare November 5, 2025 03:33
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 2 times, most recently from de2465d to 2f068be Compare November 5, 2025 11:48
@UNIDY2002 UNIDY2002 requested a review from Fridge003 as a code owner November 7, 2025 07:58
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch 9 times, most recently from 0792453 to 2f068be Compare November 7, 2025 09:49
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from 2f068be to ad44587 Compare December 12, 2025 07:15
UNIDY2002 and others added 4 commits December 15, 2025 14:54
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
@UNIDY2002 UNIDY2002 force-pushed the mooncake-pr-eplb-scheduler branch from ad44587 to 4e13ddd Compare December 15, 2025 06:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants