-
-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Word2vec: loss tally maxes at 134217728.0 due to float32 limited-precision #2735
Comments
Thanks for the details, & especially the observation that the stagnant loss-total is exactly Unfortunately the loss-calculation feature was only half-thought-out, & incompletely implemented, with pending necessary fixes & improvements (per #2617). It'd be interesting to know if in your setup that reproduces the issue:
|
OK, looks like the original implementation of loss tracking chose to use a 32-bit float. Limited-precision floating-point numbers of course become 'coarser' as they get further from Representative weirdness: In [1]: import numpy as np
In [2]: a = np.ndarray(1, dtype=np.float32)
In [3]: a[0] = 134217728.0
In [4]: a[0] # it's already an unexpected displayed value
Out[4]: 134217730.0
In [5]: a[0] = a[0] + 2.0
In [6]: a[0] # adding a small value did nothing
Out[6]: 134217730.0
In [7]: np.nextafter(a[0], np.finfo(np.float32).max) # next possible larger float32 is 10 higher
Out[7]: 134217740.0
In [8]: np.nextafter(a[0], 0) # next possible smaller float32
Out[8]: 134217727.99999999 Fixing the existing implementation to be a per-epoch tally would make the problem far less likely to occur (but a sufficiently large epoch might still trigger it). Using a highest-precision type – why not @tsaastam, given this, I'd expect the answers to my "interesting to know" bullet-points are...
...but it'd be good to hear if that's the case for you. |
I've updated the code to this to better see what's going on:
Annoyingly, on this laptop I'm using right now, the loss stays at a very high level for the first 40 epochs, at around 1.8 million, whereas on the PC it had gone down to about 680k by epoch 40. The only difference is the updated loss calculator as above and the number of workers. Anyway, assuming that's all benign, the good news is that training is still occurring after the reported loss goes to zero:
Your explanation makes sense - it didn't look like an overflow at first glance, as the loss changes pretty massively between epochs; but of course the cumulative loss is computed with lots of tiny updates, each of which is getting ignored as you've detailed. So that seems to be the cause. I'm not sure why the training is much slower on the laptop (i.e. the loss is much higher after 40 full epochs), but that probably isn't related to this. In case it matters, here are the versions of things on the laptop:
Not sure why the NumPy and SciPy versions are slightly different (on Windows they were 1.17.3 and 1.3.1). |
@tsaastam If you try setting (Also, separate from this bug: that's a lot of epochs! The last few epoch-deltas before the problem already show epoch loss jittering up-and-down; you may already be past the point, possibly far past the point, where more epochs are doing any good.) |
You're probably right about additional epochs not helping; my concern though is that on the Windows machine the loss drops fairly quickly from about 6 million to around 700k - here between epochs 8 and 9: (That image is with the bug, so the drop at the end is due to the cumulative loss suddenly dropping to zero - ignore that part.) Anyway, with the laptop, with the same data, the loss is still staying at around 1.8 million by epoch 39. Of course I now realise it's not quite the same code, since on the laptop I was running the version that measures the magnitude of the word vectors after each epoch... but that shouldn't interfere with the training? I might need to investigate this seeming training discrepancy a bit more, then maybe open a separate issue about it if it seems like a real thing. On the loss issue: taking your advice and adding
And (running on the laptop again), the loss updates now seem to work fine:
After epoch 5 there, the accumulated loss is around 170 million, which is of course more than the 134ish million where the problem occurred before. So your workaround is good. (The loss still goes down much more slowly here than on the Windows PC earlier, as I said I need to investigate that a bit more.) |
It's good to know that running Not sure what could be causing your other possibly-anomalous behavior – that very-slight loss-improvement in your last "on the laptop again" figures seems fishy, especially for early epochs where the effective |
@tsaastam |
@Cartman0 What do you mean? Are you encountering an error or expecting some efficiency problem? (Behind the scenes, I believe the value in the Python object is being copied-into a C-structure for the Cython code, then copied back out after that code tallies all the tiny errors. So that C-type, still just a 32-bit float, will be most relevant for the tallying behavior. That's probably still too coarse overall for accurate & robust loss-reporting – so this per-epoch reset to |
@gojomo I thought it may have type problem because of using to nest |
@Cartman0 I’m not sure what you mean here by ‘type problem’. What chain-of-operations could lead to a bad result? (It’s possible there’s a problem – as the use of |
@gojomo sorry to write it confusingly. I also think some cause (such as Information loss) to iterate calculating 32 bit float, because of max at 2^27. |
@Cartman0 The actual loss computation, and tallying, occurs in the Cython code – which is compiled to C/C++ & has a fixed |
@gojomo ok,thx. i'm not familiar with Cython.. functions actually computing loss are |
@Cartman0 Yes, those 2 functions, and also the negative-sampling versions of those functions, |
@gojomo setting Sample callback: class LossReportCallback(CallbackAny2Vec):
def __init__(self, reset_loss=False):
self.epoch = 1
self.previous_cumulative_loss = 0
self.reset_loss = reset_loss
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
if self.reset_loss:
model.running_training_loss = 0.0
else:
loss_now = loss - self.previous_cumulative_loss
self.previous_cumulative_loss = loss
loss = loss_now
if self.epoch % 5 == 0:
logging.info(f'loss after epoch {self.epoch}: {loss}')
self.epoch += 1 The output if executed with
The output if executed with
|
Yes, that workaround is absolutely expected to change the reported loss numbers, as precision will no longer be lost due to the tally reaching representational extremes. Improving the loss numbers is the whole point of the workaround. You are reporting suggestive evidence that the workaround works. It shouldn't have any effect on the quality of training results, as this tally (whether for all-epochs or one-epoch) isn't consulted for any model-adjustment steps. |
pnezis reports that epoch-wise loss changes by resetting If we do not reset it, model loss continues to decrease and it seems to be a successful training. Doesn't this mean that model training (or parameter updating) is affected by |
@DaikiTanak - The running loss tally is very buggy without a per-epoch reset. For large enough training sets, it might also suffer precision issues in a single epoch. But the actual training that happens, on individual (context->word) examples, is the same either way. That's not affected by this running loss tally in any way. Only the reporting-out is changing. And any reported-out tally of aggregate loss is not a measure of model quality, only model 'convergence' (reaching a point where it can't, given its structure/state, be optimized any more). A model with a higher loss tally might be better on real world problems; an embedding model with a 0.0 loss is likely broken (severely overfit). |
@gojomo Thank you for kind explaining. I understand that reported loss can be used to judge model convergence. For best practice, we should reset running loss like For example, from the above figure (epoch vs epoch-wise loss with resetting |
All I'd say for sure is that resetting each epoch is better than not. As mentioned, large-enough epochs might show the same bug within a single epoch. Not knowing what code/data generated that graph, it's hard for me to endorse any idea of what it means.(The gradual trend 'up' from X=50 to x=200 is suspicious.) |
Thanks @gojomo, the above graph is generated by following codes.
class callback(CallbackAny2Vec):
'''Callback for Word2vec with resetting loss on the end of each epoch.'''
def __init__(self):
self.epoch = 1
self.epoch = 1
self.losses = []
self.cumu_loss = 0.0
self.previous_epoch_time = time.time()
self.best_model = None
self.best_loss = 1e+30
def on_epoch_end(self, model):
loss = model.get_latest_training_loss()
norms = [np.linalg.norm(v) for v in model.wv.vectors]
now = time.time()
epoch_seconds = now - self.previous_epoch_time
self.previous_epoch_time = now
self.cumu_loss += float(loss)
print(f"Loss after epoch {self.epoch}: {loss} (cumulative loss so far: {self.cumu_loss}) "+\
f"-> epoch took {round(epoch_seconds, 2)} s - vector norms min/avg/max: "+\
f"{round(float(min(norms)), 2)}, {round(float(sum(norms)/len(norms)), 2)}, {round(float(max(norms)), 2)}")
self.epoch += 1
self.losses.append(float(loss))
# reset loss inside model
model.running_training_loss = 0.0
if loss < self.best_loss:
self.best_model = copy.deepcopy(model)
self.best_loss = loss
if self.epoch % 50 == 0:
self.plot(path="../model/word2vec/w2v_training_loss.png"))
def plot(self, path):
fig, (ax1) = plt.subplots(ncols=1, figsize=(6, 6))
ax1.plot(self.losses, label="loss per epoch")
plt.legend()
plt.savefig(path)
plt.close()
print("Plotted loss.")
model = word2vec.Word2Vec(
size=100,
min_count=1,
window=5,
workers=4,
sg=1,
seed=46,
iter=200,
compute_loss=True,
)
print("building vocabulary...")
# sentence corpus is like : [["this", "is", "a", "dog"], ["he", "is", "a", "student"], ... ]
model.build_vocab(sentence_corpus)
print("training Word2Vec...")
callbacker = callback()
model.train(
sentence_corpus,
epochs=model.iter,
total_examples=model.corpus_count,
compute_loss=True,
callbacks=[callbacker],
) |
I don't see any specific reason the per-epoch loss might be trending up, in your code, but a few other notes: (1) for reasons previously alluded to, and the fact that the tally doesn't include the effects of late-in-epoch adjustments on early-in-epoch examples, the model with the lowest end-of-epoch loss tally is not necessarily 'best'; (2) I've never actually |
@gojomo @tsaastam @DaikiTanak @pnezis Hi! I have the same question: right from the start of training I get 134217728.0 loss after each epoch (constantly). What decision have been found finally? |
@loveis98 The bugs limiting the usefulness/interpretability of the Manually resetting the tally to |
Cumulative loss of word2vec maxes out at 134217728.0
I'm training a word2vec model with 2,793,404 sentences / 33,499,912 words, vocabulary size 162,253 (words with at least 5 occurrences).
Expected behaviour: with
compute_loss=True
, gensim's word2vec should compute the loss in the expected way.Actual behaviour: the cumulative loss seems to be maxing out at
134217728.0
:And it stays at
134217728.0
thereafter. The value134217728.0
is of course exactly128*1024*1024
, which does not seem like a coincidence.Steps to reproduce
My code is as follows:
The data is a news article corpus in Finnish; I'm not at liberty to share all of it (and anyway it's a bit big), but it looks like one would expect:
Versions
The output of:
is:
Finally, I'm not the only one who has encountered this issue. I found the following related links:
https://groups.google.com/forum/#!topic/gensim/IH5-nWoR_ZI
https://stackoverflow.com/questions/59823688/gensim-word2vec-model-loss-becomes-0-after-few-epochs
I'm not sure if this is only a display issue and the training continues normally even after the cumulative loss reaches its "maximum", or if the training in fact stops at that point. The trained word vectors seem reasonably ok, judging by
my_model.wv.evaluate_word_analogies()
, though they do need more training than this.The text was updated successfully, but these errors were encountered: