You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
loss 3.28 = 3.28 + 0.0 avg prob of [Rishi Sunak] 0.0498
loss nan = nan + nan avg prob of [Rishi Sunak] nan
loss nan = nan + nan avg prob of [Rishi Sunak] nan
loss nan = nan + nan avg prob of [Rishi Sunak] nan
The gradient of delta weight becomes nan after the first backward operation.
It may be caused by the alibi position encoding of the current implementation of the Baichuan-13B model. The alibi position encoding does not accept the attention mask thus it is incompatible with left-padding. We are trying to fix it through re-implement the Baichuan-13B model.
The gradient of delta weight becomes
nan
after the first backward operation.By using:
We caught a runtime error by the script.
I suppose that it may be related to the alibi attention masks of Baichuan-13B.
The text was updated successfully, but these errors were encountered: