You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Afaik the factor of 4 means 1 forward, 2 backward & 1 forward, where the last forward is needed for ckpt acts. Don't we also need all 4 for the LM Head? cc @RaymondLi0@NouamaneTazi
The text was updated successfully, but these errors were encountered:
In selective recomputation, we only checkpoint attention, so it shouldnt affect LM Head
In full recomputation, we assume that the last hidden_states are the LM Head activs, so no need for recomputing them
In selective recomputation, we only checkpoint attention, so it shouldnt affect LM Head In full recomputation, we assume that the last hidden_states are the LM Head activs, so no need for recomputing them
Why are we not multiplying the LM Head flops per iteration with the
checkpoint_activations_factor
?Megatron-LM/megatron/utils.py
Line 253 in bd0aaba
Afaik the factor of 4 means 1 forward, 2 backward & 1 forward, where the last forward is needed for ckpt acts. Don't we also need all 4 for the LM Head? cc @RaymondLi0 @NouamaneTazi
The text was updated successfully, but these errors were encountered: