-
Notifications
You must be signed in to change notification settings - Fork 3.8k
[3/N] Achieve fault tolerance at the DP level #11657
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[3/N] Achieve fault tolerance at the DP level #11657
Conversation
65a0b1f to
6f302c3
Compare
2a08f54 to
09c4da0
Compare
09c4da0 to
567ada0
Compare
c13f1fc to
e7f23db
Compare
|
|
||
| _TP: Optional[GroupCoordinator] = None | ||
| _TP_ACTIVE_RANKS: Optional[torch.Tensor] = None | ||
| _TP_ACTIVE_RANKS_CPU: Optional[torch.Tensor] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking, would it be more reasonable to rename the elastic ep module to elstic and group together concepts like _TP_ACTIVE_RANKS and _TP_ACTIVE_RANKS_CPU as much as possible?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make it less intrusive to the original logic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds reasonable
d03f5e2 to
9959c80
Compare
de2465d to
2f068be
Compare
0792453 to
2f068be
Compare
2f068be to
ad44587
Compare
Co-authored-by: Hank Han <hanhan7630@outlook.com>
Co-authored-by: UNIDY2002 <unidy2002@outlook.com>
ad44587 to
4e13ddd
Compare
Motivation
The previous work of this PR only implemented fault tolerance at the EP level, but not at the DP level; we achieved this functionality by maintaining worker information through communication between the
Scheduler,TokenizerManager, andDataParallelController.Modifications
io_struct.py: Define scheduler status informationscheduler.py: Send status information to TokenizerManager.tokenizer_manager.py: Write a handler function to accept this type of information and send information to DataParallelController.data_parallel_controller.py: Write a handler function to accept this type of information and maintain worker information.Accuracy Tests
added a new unit test
test/srt/ep/test_mooncake_ep_small.pyBenchmarking and Profiling
Checklist