Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"backward pass is invalid for module in evaluation mode" with deepspeed stage 3 #19467

Open
olegsinavski opened this issue Feb 13, 2024 · 6 comments
Labels
question Further information is requested

Comments

@olegsinavski
Copy link

Bug description

Hello,
After 2.2 Lightning upgrade and only with deepspeed stage 3, we experience a crash backward pass is invalid for module in evaluation mode. Most likely is caused by the recent changes in train/eval mode switching.

What version are you seeing the problem on?

v2.2

How to reproduce the bug

use deepspeed stage 3 strategy

Error messages and logs

  File "/pip-ai-experimental_torch/site-packages/torch/autograd/__init__.py", line 251, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
  File "/pip-ai-experimental_torch/site-packages/torch/autograd/function.py", line 288, in apply
    return user_fn(self, *args)
  File "/pip-core_deepspeed/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 170, in backward
    ctx.pre_backward_function(ctx.module)
  File "/pip-core_deepspeed/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/pip-core_deepspeed/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 447, in _run_before_backward_function
    self.pre_sub_module_backward_function(sub_module)
  File "/pip-ai-experimental_torch/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/pip-core_deepspeed/site-packages/deepspeed/runtime/zero/parameter_offload.py", line 524, in pre_sub_module_backward_function
    assert sub_module.training, "backward pass is invalid for module in evaluation mode"
AssertionError: backward pass is invalid for module in evaluation mode

Environment

torch 2.1.2

More info

No response

@olegsinavski olegsinavski added bug Something isn't working needs triage Waiting to be triaged by maintainers labels Feb 13, 2024
@awaelchli
Copy link
Contributor

@olegsinavski This is unavoidable unfortunately. You must have a module somewhere in LightningModule that's in eval mode, and deepspeed makes this absurd decision to not let you backward (even though train/eval mode has nothing to do with gradient computation).

Maybe you have a HFModel.from_pretrained() or similar that defaults to eval mode.

@awaelchli awaelchli added question Further information is requested and removed bug Something isn't working needs triage Waiting to be triaged by maintainers labels Feb 13, 2024
@olegsinavski
Copy link
Author

olegsinavski commented Feb 13, 2024

Yes, I do actually have HFModel.from_pretrained()! Could you please elaborate why it's unavoidable? From the perspective of Lightning user, it does look like a regression in the behavior of 2.2: on 2.1 it works fine, but on 2.2 stage 3 doesn't work (the stage 2 still works). Since it worked on 2.1 Lightning, there was something that made it work?

You're right, I just checked that .from_pretrained() produces model with model.training == False for both stage 2 and stage 3. Why would it work for Lightning with stage 2 then?

Thanks a lot though, the workaround is to call self.model.train() after HFModel.from_pretrained()

P.S. I use configure_model as opposed to a constructor

@awaelchli
Copy link
Contributor

awaelchli commented Feb 13, 2024

Could you please elaborate why it's unavoidable?
Since it worked on 2.1 Lightning, there was something that made it work?

HuggingFace models loaded with .from_pretrained() are by default in eval mode, because they are expected to be used for inference. To train, you would need to call .train() on them. This is special to HuggingFace, because by default in PyTorch if you create any nn.Module, it will be in training mode.

In Lightning 2.1 and prior, Trainer called model.train() at the beginning of .fit(), which is why your HuggingFace model was silently converted to training mode. While in your case this might have been ok (or desired), this automatic conversion had one big problem: If you had frozen layers in eval mode that you didn't want to train, the Trainer would silently convert them to train mode anyway and there would be no way for the user to know that. Overriding that behavior required implementing a special hook in Lightning, but this was practically not discoverable by users. In 2.2, we removed this silent behavior, so users don't fall into this trap. Now, if you train a model in Lightning 2.2+ and have part of it frozen (e.g. you have a feature extractor that you don't train), then the training mode you set will be respected. I consider this an important fix, it has personally bothered me for a long time. Especially when users would compare their results trained with Lightning vs. raw PyTorch, and saw a discrepancy in the losses/accuracies due to this silent conversion that was happening.

I think one thing we could do, for visibility, is to include a column in the ModelSummary (the table that prints by default at the beginning of fit) that shows the training mode. This could help users sanity check this better.

@olegsinavski
Copy link
Author

Ok, thank you for the explanation! Totally makes sense! (although I personally prefer "require_grad=False" for frozen bits).
The last question - does it mean that that with stage 2, it also in val mode, its just stage 2 doesn't have that check so it doesn't crash?

I guess the reason stage 2 still works for me is because I don't have any dropout/batch_norms that change behavior depending on training flag... So everyone needs to call train() after .from_pretrained().

If I understand correctly, some users might have silent bugs if the used .from_pretrained() with dropout and batchnorms after the 2.2 upgrade. But I guess the new behavior is less "state-mutating" and hence safer on average

@awaelchli
Copy link
Contributor

No, it has nothing to do with stage 2. What I explained applies in general in the Trainer for all strategies, not just deepspeed.

But I guess the new behavior is less "state-mutating" and hence safer on average

Yes that's the goal. I wasn't aware that huggingface .from_pretrained() is in eval mode, but retrospectively it makes sense. Since Lightning is a general Trainer, this is unfortunate for huggingface users to make that extra step but I believe in the long-term it's better and more flexible for everyone.

Regarding huggingface+lightning, there are a few other things that need to be documented. We need a page in the docs "Using HuggingFace models with Lightning" that explains these things, including the .train() thing we talked about here.

@Boltzmachine
Copy link

Boltzmachine commented Sep 19, 2024

Does it mean I should do something like this

def on_train_start(self):
    self.llm.train()

def on_train_end(self):
    self.llm.eval()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants