[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2

**Describe the bug**
when fine-tuning my model using deepspeed==0.13.5, and huggingface trainer, loss and grad_norm will be nan at step 2
![image](https://github.com/microsoft/DeepSpeed/assets/29994840/cf2d7a6b-91df-43d6-9706-aa82c2dbf074)

but 2 ways below could solve the problem
1. deepseed==0.10.2
2. add this to my deepspeed config, (which slow down my training speed)
```
"comms_logger": {
  "enabled": true,
  "verbose": false,
  "prof_all": true,
  "debug": false
}
```
why this happen?  maybe there are bugs I don't know? 
or any clues to solve this?

**To Reproduce**
Steps to reproduce the behavior:
1. my run script
```
deepspeed \
    pretraining.py \
    --model_type auto \
    --model_name_or_path /app/nfs_share_dir/3/llm_model/Baichuan2-7B-Base \
    --train_file_dir /app/nfs_share_dir/1/archive/v2/token-baichuan/tmp \
    --validation_file_dir /app/nfs_share_dir/1/archive/v2/token-baichuan/tmp \
    --lazy_mode True \
    --per_device_train_batch_size 8 \
    --per_device_eval_batch_size 8 \
    --do_train \
    --do_eval \
    --seed 3 \
    --warmup_ratio 0.01 \
    --num_train_epochs 1 \
    --learning_rate 2e-5 \
    --lr_scheduler_type cosine \
    --weight_decay 1e-4 \
    --logging_strategy steps \
    --logging_steps 1 \
    --save_steps 1000 \
    --save_strategy steps \
    --save_total_limit 10 \
    --gradient_accumulation_steps 1 \
    --block_size 4096 \
    --torch_compile True \
    --output_dir outputs_qwen \
    --overwrite_output_dir \
    --ddp_timeout 30000 \
    --logging_first_step True \
    --log_on_each_node 0 \
    --torch_dtype bfloat16 \
    --report_to tensorboard \
    --ddp_find_unused_parameters False \
    --gradient_checkpointing True \
    --deepspeed ./config/ds_2_config.json \
    --bf16 \
    --bf16_full_eval
```
2.  deepspeed config
```
{
    "optimizer": {
        "type": "AdamW",
        "params": {
            "lr": "auto",
            "weight_decay": "auto",
            "torch_adam": true,
            "adam_w_mode": true
        }
    },
    "bf16": {
        "enabled": true
    },
    "zero_optimization": {
        "stage": 2,
        "allgather_partitions": true,
        "allgather_bucket_size": 5e8,
        "reduce_scatter": true,
        "reduce_bucket_size": "auto",
        "overlap_comm": true,
        "contiguous_gradients": true
    },
    "flops_profiler": {
        "enabled": true,
        "profile_step": 10,
        "module_depth": -1,
        "top_modules": 1,
        "detailed": true
    },
    "tensorboard": {
        "enabled": true
    },
    "gradient_accumulation_steps": "auto",
    "gradient_clipping": "auto",
    "steps_per_print": 1000,
    "train_batch_size": "auto",
    "train_micro_batch_size_per_gpu": "auto",
    "wall_clock_breakdown": false
}
```

**Expected behavior**
loss != 0 or nan

**ds_report output**
```
[2024-03-08 17:53:06,490] [INFO] [real_accelerator.py:191:get_accelerator] Setting ds_accelerator to cuda (auto detect)
--------------------------------------------------
DeepSpeed C++/CUDA extension op report
--------------------------------------------------
NOTE: Ops not installed will be just-in-time (JIT) compiled at
      runtime if needed. Op compatibility means that your system
      meet the required dependencies to JIT install the op.
--------------------------------------------------
JIT compiled ops requires ninja
ninja .................. [OKAY]
--------------------------------------------------
op name ................ installed .. compatible
--------------------------------------------------
 [WARNING]  async_io requires the dev libaio .so object and headers but these were not found.
 [WARNING]  async_io: please install the libaio-dev package with apt
 [WARNING]  If libaio is already installed (perhaps from source), try setting the CFLAGS and LDFLAGS environment variables to where it can be found.
async_io ............... [NO] ....... [NO]
fused_adam ............. [NO] ....... [OKAY]
cpu_adam ............... [NO] ....... [OKAY]
cpu_adagrad ............ [NO] ....... [OKAY]
cpu_lion ............... [NO] ....... [OKAY]
 [WARNING]  Please specify the CUTLASS repo directory as environment variable $CUTLASS_PATH
evoformer_attn ......... [NO] ....... [NO]
fused_lamb ............. [NO] ....... [OKAY]
fused_lion ............. [NO] ....... [OKAY]
inference_core_ops ..... [NO] ....... [OKAY]
cutlass_ops ............ [NO] ....... [OKAY]
transformer_inference .. [NO] ....... [OKAY]
quantizer .............. [NO] ....... [OKAY]
ragged_device_ops ...... [NO] ....... [OKAY]
ragged_ops ............. [NO] ....... [OKAY]
random_ltd ............. [NO] ....... [OKAY]
 [WARNING]  sparse_attn requires a torch version >= 1.5 and < 2.0 but detected 2.1
 [WARNING]  using untested triton version (2.1.0), only 1.0.0 is known to be compatible
sparse_attn ............ [NO] ....... [NO]
spatial_inference ...... [NO] ....... [OKAY]
transformer ............ [NO] ....... [OKAY]
stochastic_transformer . [NO] ....... [OKAY]
--------------------------------------------------
DeepSpeed general environment info:
torch install path ............... ['/opt/conda/lib/python3.10/site-packages/torch']
torch version .................... 2.1.2
deepspeed install path ........... ['/opt/conda/lib/python3.10/site-packages/deepspeed']
deepspeed info ................... 0.13.5, unknown, unknown
torch cuda version ............... 12.1
torch hip version ................ None
nvcc version ..................... 12.1
deepspeed wheel compiled w. ...... torch 2.1, cuda 12.1
shared memory (/dev/shm) size .... 503.72 GB
```

**Screenshots**
![image](https://github.com/microsoft/DeepSpeed/assets/29994840/cf2d7a6b-91df-43d6-9706-aa82c2dbf074)

**System info (please complete the following information):**
 - OS: Linux version 3.10.0-1127.19.1.el7.x86_64 (mockbuild@kbuilder.bsys.centos.org) (gcc version 4.8.5 20150623
 - 1 machines with x8 A800s each
 - Interconnects (if applicable) [e.g., two machines connected with 100 Gbps IB]
 - Python version: 3.10.13
 - transformers==4.38.0


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] grad_norm and loss is nan when deepspeed==0.13.5 but ok with deepspeed==0.10.2 #5242

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development