"backward pass is invalid for module in evaluation mode" with deepspeed stage 3 #19467

olegsinavski · 2024-02-13T19:58:20Z

Bug description

Hello,
After 2.2 Lightning upgrade and only with deepspeed stage 3, we experience a crash backward pass is invalid for module in evaluation mode. Most likely is caused by the recent changes in train/eval mode switching.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

use deepspeed stage 3 strategy

Error messages and logs

  File "/pip-ai-experimental_torch/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/pip-ai-experimental_torch/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/pip-core_deepspeed/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 170, in backward
    ctx.pre_backward_function(ctx.module)
  File "/pip-core_deepspeed/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/pip-core_deepspeed/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 447, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/pip-ai-experimental_torch/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/pip-core_deepspeed/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 524, in pre_sub_module_backward_function
    assert sub_module.training, "backward pass is invalid for module in evaluation mode"
AssertionError: backward pass is invalid for module in evaluation mode

Environment

torch 2.1.2

More info

No response

The text was updated successfully, but these errors were encountered:

awaelchli · 2024-02-13T20:05:41Z

@olegsinavski This is unavoidable unfortunately. You must have a module somewhere in LightningModule that's in eval mode, and deepspeed makes this absurd decision to not let you backward (even though train/eval mode has nothing to do with gradient computation).

Maybe you have a HFModel.from_pretrained() or similar that defaults to eval mode.

olegsinavski · 2024-02-13T22:09:13Z

Yes, I do actually have HFModel.from_pretrained()! Could you please elaborate why it's unavoidable? From the perspective of Lightning user, it does look like a regression in the behavior of 2.2: on 2.1 it works fine, but on 2.2 stage 3 doesn't work (the stage 2 still works). Since it worked on 2.1 Lightning, there was something that made it work?

You're right, I just checked that .from_pretrained() produces model with model.training == False for both stage 2 and stage 3. Why would it work for Lightning with stage 2 then?

Thanks a lot though, the workaround is to call self.model.train() after HFModel.from_pretrained()

P.S. I use configure_model as opposed to a constructor

awaelchli · 2024-02-13T22:25:35Z

Could you please elaborate why it's unavoidable?
Since it worked on 2.1 Lightning, there was something that made it work?

HuggingFace models loaded with .from_pretrained() are by default in eval mode, because they are expected to be used for inference. To train, you would need to call .train() on them. This is special to HuggingFace, because by default in PyTorch if you create any nn.Module, it will be in training mode.

In Lightning 2.1 and prior, Trainer called model.train() at the beginning of .fit(), which is why your HuggingFace model was silently converted to training mode. While in your case this might have been ok (or desired), this automatic conversion had one big problem: If you had frozen layers in eval mode that you didn't want to train, the Trainer would silently convert them to train mode anyway and there would be no way for the user to know that. Overriding that behavior required implementing a special hook in Lightning, but this was practically not discoverable by users. In 2.2, we removed this silent behavior, so users don't fall into this trap. Now, if you train a model in Lightning 2.2+ and have part of it frozen (e.g. you have a feature extractor that you don't train), then the training mode you set will be respected. I consider this an important fix, it has personally bothered me for a long time. Especially when users would compare their results trained with Lightning vs. raw PyTorch, and saw a discrepancy in the losses/accuracies due to this silent conversion that was happening.

I think one thing we could do, for visibility, is to include a column in the ModelSummary (the table that prints by default at the beginning of fit) that shows the training mode. This could help users sanity check this better.

olegsinavski · 2024-02-13T22:59:12Z

Ok, thank you for the explanation! Totally makes sense! (although I personally prefer "require_grad=False" for frozen bits).
The last question - does it mean that that with stage 2, it also in val mode, its just stage 2 doesn't have that check so it doesn't crash?

I guess the reason stage 2 still works for me is because I don't have any dropout/batch_norms that change behavior depending on training flag... So everyone needs to call train() after .from_pretrained().

If I understand correctly, some users might have silent bugs if the used .from_pretrained() with dropout and batchnorms after the 2.2 upgrade. But I guess the new behavior is less "state-mutating" and hence safer on average

awaelchli · 2024-02-16T13:02:08Z

No, it has nothing to do with stage 2. What I explained applies in general in the Trainer for all strategies, not just deepspeed.

But I guess the new behavior is less "state-mutating" and hence safer on average

Yes that's the goal. I wasn't aware that huggingface .from_pretrained() is in eval mode, but retrospectively it makes sense. Since Lightning is a general Trainer, this is unfortunate for huggingface users to make that extra step but I believe in the long-term it's better and more flexible for everyone.

Regarding huggingface+lightning, there are a few other things that need to be documented. We need a page in the docs "Using HuggingFace models with Lightning" that explains these things, including the .train() thing we talked about here.

Boltzmachine · 2024-09-19T02:31:56Z

Does it mean I should do something like this

def on_train_start(self):
    self.llm.train()

def on_train_end(self):
    self.llm.eval()

olegsinavski added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Feb 13, 2024

awaelchli added question Further information is requested and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Feb 13, 2024

awaelchli mentioned this issue Feb 13, 2024

Include the training mode in the ModelSummary #19468

Merged

ringohoffman mentioned this issue Mar 1, 2024

Remove unnecessary assert on sub_module.training microsoft/DeepSpeed#5215

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"backward pass is invalid for module in evaluation mode" with deepspeed stage 3 #19467

"backward pass is invalid for module in evaluation mode" with deepspeed stage 3 #19467

olegsinavski commented Feb 13, 2024

awaelchli commented Feb 13, 2024

olegsinavski commented Feb 13, 2024 •

edited

Loading

awaelchli commented Feb 13, 2024 •

edited

Loading

olegsinavski commented Feb 13, 2024

awaelchli commented Feb 16, 2024

Boltzmachine commented Sep 19, 2024 •

edited

Loading

"backward pass is invalid for module in evaluation mode" with deepspeed stage 3 #19467

"backward pass is invalid for module in evaluation mode" with deepspeed stage 3 #19467

Comments

olegsinavski commented Feb 13, 2024

Bug description

What version are you seeing the problem on?

How to reproduce the bug

Error messages and logs

Environment

More info

awaelchli commented Feb 13, 2024

olegsinavski commented Feb 13, 2024 • edited Loading

awaelchli commented Feb 13, 2024 • edited Loading

olegsinavski commented Feb 13, 2024

awaelchli commented Feb 16, 2024

Boltzmachine commented Sep 19, 2024 • edited Loading

olegsinavski commented Feb 13, 2024 •

edited

Loading

awaelchli commented Feb 13, 2024 •

edited

Loading

Boltzmachine commented Sep 19, 2024 •

edited

Loading