Remove unnecessary assert on sub_module.training #5215

ringohoffman · 2024-03-01T01:45:12Z

Related: Lightning-AI/pytorch-lightning#19467

Why were these asserts added? nn.Module.training is for controlling forward() behavior. I have never seen it used to control backward() behavior, let alone raise an error because of it.

To me and the users in the ticket I linked, these checks are very unexpected. It seems like training is being co-opted for something beyond what it was originally intended for. If I just put my whole model into train mode before calling backward, I stop seeing these errors. How is training expected to be set on partially frozen models and why?

JakobLS · 2024-03-02T10:08:13Z

When removing these asserts it allows me to launch my training script. Yet unaware of whether it has any consequences further down though.

championsnet · 2024-04-06T15:54:45Z

I tried removing these asserts as well but now I get an error later on parameter_offload.py:
File "/.local/lib64/python3.11/site-packages/deepspeed/runtime/zero/partitioned_param_coordinator.py", line 316, in fetch_sub_module assert param.ds_status == ZeroParamStatus.AVAILABLE, param.ds_summary()

Boltzmachine · 2024-09-19T02:28:35Z

Does it get merged? It is really ridiculous to have such an assertion

ringohoffman · 2024-09-19T04:05:14Z

Does it get merged? It is really ridiculous to have such an assertion

Last I checked, it doesn't work even if you remove the assertion. That is why I gave up on this.

tjruwase · 2024-09-19T10:59:53Z

Does it get merged? It is really ridiculous to have such an assertion

Last I checked, it doesn't work even if you remove the assertion. That is why I gave up on this.

@ringohoffman, @Boltzmachine, apologies for missing this.

First, I want to affirm your observations:

The assertion is incorrect and a problem for backward pass of frozen weights.
We hijacked .training for module prefetching optimization by have separate prefetcher for (1) forward+backward trace (i.e., training) and (2) forward trace (i.e., eval/inference).
It makes sense that removing these assertions uncovers problems previously unknown to us, since we have not tested those execution paths. We have previously assumed .train() workaround to make things work. However, creating gradients on frozen weights is not a long-term solution, so requires fixing.

Second, here are my thoughts for next steps:

Remove these assertions and the misuse of .training for prefetching optimization. This will enable correct handling of backward on frozen weights and avoid user confusions (such as those discussed in the Lightning forum).
Fix issues arising from the above change. @ringohoffman, are you able to revive this PR or share those failures? I am curious if disabling prefetching would address the failures you observed.
Improve prefetching robustness:
1. A different way to distinguish between forward+backward trace and forward trace.
2. Eliminate prefetching across forward/backward boundary. Instead have separate prefetchers for forward and backward traces. We need to understand the performance implications of this.

We would really appreciate your help on the above plan. We also understand this might no longer be priority for you.

@tohtana, FYI

Remove unnecessary assert on sub_module.training

0b6e185

ringohoffman requested review from tjruwase and mrwyattii as code owners March 1, 2024 01:45

Merge branch 'master' into remove-unnecessary-stage-3-training-assert

062d0cf

ringohoffman marked this pull request as draft March 8, 2024 17:49

ringohoffman closed this Apr 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove unnecessary assert on sub_module.training #5215

Remove unnecessary assert on sub_module.training #5215

ringohoffman commented Mar 1, 2024 •

edited

Loading

JakobLS commented Mar 2, 2024

championsnet commented Apr 6, 2024

Boltzmachine commented Sep 19, 2024

ringohoffman commented Sep 19, 2024

tjruwase commented Sep 19, 2024

Remove unnecessary assert on sub_module.training #5215

Remove unnecessary assert on sub_module.training #5215

Conversation

ringohoffman commented Mar 1, 2024 • edited Loading

JakobLS commented Mar 2, 2024

championsnet commented Apr 6, 2024

Boltzmachine commented Sep 19, 2024

ringohoffman commented Sep 19, 2024

tjruwase commented Sep 19, 2024

ringohoffman commented Mar 1, 2024 •

edited

Loading