-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"backward pass is invalid for module in evaluation mode" with deepspeed stage 3 #19467
Comments
@olegsinavski This is unavoidable unfortunately. You must have a module somewhere in LightningModule that's in eval mode, and deepspeed makes this absurd decision to not let you backward (even though train/eval mode has nothing to do with gradient computation). Maybe you have a |
Yes, I do actually have You're right, I just checked that Thanks a lot though, the workaround is to call P.S. I use |
HuggingFace models loaded with In Lightning 2.1 and prior, Trainer called I think one thing we could do, for visibility, is to include a column in the ModelSummary (the table that prints by default at the beginning of fit) that shows the training mode. This could help users sanity check this better. |
Ok, thank you for the explanation! Totally makes sense! (although I personally prefer "require_grad=False" for frozen bits). I guess the reason stage 2 still works for me is because I don't have any dropout/batch_norms that change behavior depending on If I understand correctly, some users might have silent bugs if the used |
No, it has nothing to do with stage 2. What I explained applies in general in the Trainer for all strategies, not just deepspeed.
Yes that's the goal. I wasn't aware that huggingface Regarding huggingface+lightning, there are a few other things that need to be documented. We need a page in the docs "Using HuggingFace models with Lightning" that explains these things, including the |
Does it mean I should do something like this def on_train_start(self):
self.llm.train()
def on_train_end(self):
self.llm.eval() |
Bug description
Hello,
After 2.2 Lightning upgrade and only with deepspeed stage 3, we experience a crash
backward pass is invalid for module in evaluation mode
. Most likely is caused by the recent changes intrain/eval
mode switching.What version are you seeing the problem on?
v2.2
How to reproduce the bug
Error messages and logs
Environment
torch 2.1.2
More info
No response
The text was updated successfully, but these errors were encountered: