-
Notifications
You must be signed in to change notification settings - Fork 4.6k
Description
Describe the bug
I adapted my training process from the hugging face trainer.py, so most of my trainer is similar to theirs. My model includes a language model and an external module for learning other parameters using information from the main model. I put the main and external parameters in separate groups in the optimizer. While testing the code on larger models with deepspeed, I encountered an assertion error after 50 training steps and 1 evaluation round. The error is related to the embedding matrix, which remains in "INFLIGHT" status after resuming training, while all other parameters are "AVAILABLE".
AssertionError: {'id': 786, 'status': 'INFLIGHT', 'numel': 38603520, 'ds_numel': 38603520, 'shape':
(50265, 768), 'ds_shape': (50265, 768), 'requires_grad': False, 'grad_shape': None, 'persist': False,
'active_sub_modules': {3}}
The program runs fine if I simply train a language model without an external module.
To Reproduce
I can provide the code if necessary, but hope that I can get some help on how to further debug this issue :)
Expected behavior
I expected the training runs fine after evaluation.
Screenshots
If applicable, add screenshots to help explain your problem.
System info (please complete the following information):
- OS: Linux 4.18.0-425.13.1.el8_7.x86_64
- GPU count and types: 1 machine x2 A100s
- Interconnects (if applicable)
- Python version: python 3.8
Launcher context
I am using deepspeed to launch my program.