-
Notifications
You must be signed in to change notification settings - Fork 4.5k
Description
Describe the bug
As title, when enable overlap_comm and contiguous_gradients together, grad_norm will be nan (or be a constant float value in the latest master code, w/ this pr:#7171 , seems still not fixed the root cause of nan).
But it works fine when 'overlap_comm:false' w/ 'contiguous_gradients:true' OR 'overlap_comm:true' w/ 'contiguous_gradients:false' . It seems some bug behind contiguous_gradients, maybe some memory copy conflict?
To Reproduce
Steps to reproduce the behavior:
- in my private dataset, i can always reproduce it.
- I think we can discuss on this issue, more likely to have some code reviews to find the bug? because it is strongly have a connection between 'overlap_comm' and 'contiguous_gradients'.
Expected behavior
A clear and concise description of what you expected to happen.
ds_report output
Please run ds_report
to give us details about your setup.
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: [e.g. Ubuntu 18.04]
- GPU count and types [e.g. two machines with x8 A100s each]
- Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
- Python version
- Any other relevant info about your setup
Launcher context
Are you launching your experiment with the deepspeed
launcher, MPI, or something else?
Docker context
Are you using a specific docker image that you can share?
Additional context
Add any other context about the problem here.