Skip to content

[BUG] INFLIGHT parameters after evaluation #3068

@xiamengzhou

Description

@xiamengzhou

Describe the bug
I adapted my training process from the hugging face trainer.py, so most of my trainer is similar to theirs. My model includes a language model and an external module for learning other parameters using information from the main model. I put the main and external parameters in separate groups in the optimizer. While testing the code on larger models with deepspeed, I encountered an assertion error after 50 training steps and 1 evaluation round. The error is related to the embedding matrix, which remains in "INFLIGHT" status after resuming training, while all other parameters are "AVAILABLE".

AssertionError: {'id': 786, 'status': 'INFLIGHT', 'numel': 38603520, 'ds_numel': 38603520, 'shape':
(50265, 768), 'ds_shape': (50265, 768), 'requires_grad': False, 'grad_shape': None, 'persist': False,
'active_sub_modules': {3}}

The program runs fine if I simply train a language model without an external module.

To Reproduce
I can provide the code if necessary, but hope that I can get some help on how to further debug this issue :)

Expected behavior
I expected the training runs fine after evaluation.

Screenshots
If applicable, add screenshots to help explain your problem.

System info (please complete the following information):

  • OS: Linux 4.18.0-425.13.1.el8_7.x86_64
  • GPU count and types: 1 machine x2 A100s
  • Interconnects (if applicable)
  • Python version: python 3.8

Launcher context
I am using deepspeed to launch my program.

Metadata

Metadata

Assignees

Labels

bugSomething isn't workingtraining

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions