🐛 Describe the bug
When I was training qwen3-8b using liger-kernel + zero3, an error occurred during backpropagation:
The size of tensor a (0) must match the size of tensor b (4096) at non-singleton dimension 1
After debugging, I found that changing the stage3_param_persistence_threshold parameter in the zero3 config from auto to 1e10 can solve this problem. Alternatively, changing zero3 to zero2 also works.
I want to ask, why is this? I might not have this problem when training gemma3. Can anyone help explain where the compatibility issue between liger-kernel and deepspeed is when training Qwen3?
Hope your answer, thanks!
Reproduce
No response
Versions
Operating System: Linux-5.15.0-126-generic-x86_64-with-glibc2.39
Python version: 3.12.12
Liger Kernel version: 0.5.10
PyTorch version: 2.7.1+cu126
CUDA version: 12.6
HIP(ROCm) version: Not available
Triton version: 3.3.1
Transformers version: 4.51.3
XPU version: XPU Not Available