-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
reduce cpu host overhead when using moe #5578
reduce cpu host overhead when using moe #5578
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ranzhejiang Thank you for your contribution! I have a few questions about your changes. Can you clarify them?
e9e32f4
to
d860d2c
Compare
Hi, @tohtana I have clarified the modifications you mentioned and retest this PR with Megatron-Deepspeed on GPU platform(8xA800). It runs well and loss remains consistent with the original method, Could you please help review it again? Thanks! |
686f511
to
23ec4a1
Compare
23ec4a1
to
1cb0efd
Compare
#5881 also adopts this plan to reduce cpu time |
The operation
.to('cpu')
is not necessary for exp_counts, and it will cause device to host synchronization which damage performance.