Skip to content

[BUG]when use 'overlap_comm:True' w/ 'contiguous_gradients:True', grad_norm is NaN #7188

@whlook

Description

@whlook

Describe the bug
As title, when enable overlap_comm and contiguous_gradients together, grad_norm will be nan (or be a constant float value in the latest master code, w/ this pr:#7171 , seems still not fixed the root cause of nan).
But it works fine when 'overlap_comm:false' w/ 'contiguous_gradients:true' OR 'overlap_comm:true' w/ 'contiguous_gradients:false' . It seems some bug behind contiguous_gradients, maybe some memory copy conflict?

To Reproduce
Steps to reproduce the behavior:

  1. in my private dataset, i can always reproduce it.
  2. I think we can discuss on this issue, more likely to have some code reviews to find the bug? because it is strongly have a connection between 'overlap_comm' and 'contiguous_gradients'.

Expected behavior
A clear and concise description of what you expected to happen.

ds_report output
Please run ds_report to give us details about your setup.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: [e.g. Ubuntu 18.04]
  • GPU count and types [e.g. two machines with x8 A100s each]
  • Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
  • Python version
  • Any other relevant info about your setup

Launcher context
Are you launching your experiment with the deepspeed launcher, MPI, or something else?

Docker context
Are you using a specific docker image that you can share?

Additional context
Add any other context about the problem here.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions