-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Some modules we should implement:
CrossEntropyorCE. Should this cover both dense and sparse targets, or do we want a separate module for the sparse case, likeSparseCrossEntropyor so? Should this potentially also allow for logits? log-probs?KLorKullbackLeiblerDivergenceBinaryCrossEntropyorBCEL2Dist(absolute or mean?) OrMSEorMeanSquarredError? (The mean reduction is over the feature axis. Not over time or batch.)L1Dist(absolute or mean?) OrMeanL1Dist?CtcorCTCorCtcLogProbCosineSimilarity
I don't like the naming of the PyTorch losses too much here.
They have the postfix Loss on all of them, although these modules are generic and not necessarily just for loss computation (although that's probably their most common usage).
Also CrossEntropyLoss is actually log-softmax + CE together. So very much like the TF tf.nn.sparse_softmax_cross_entropy_with_logits.
And there is a separate NLLLoss. Which is just like CrossEntropyLoss but it doesn't take logits but log-prob instead. I find this naming to be confusing.
Also the question is how we should handle things like label smoothing. On RETURNN side (and also in TF), it is just an option to the CE-loss. On PyTorch side, it is not implemented yet as part of the official PyTorch API. Some background here. It was only very recently added (pytorch/pytorch#7455, pytorch/pytorch#63122). This also adds it as an option label_smoothing to CrossEntropyLoss. An alternative would be that the user makes this more explicit, like:
target_prob_smooth = smooth_one_hot(targets, label_prob=0.9)
loss = cross_entropy(target_prob_smooth, out_prob)
Although label smoothing has become very common, so maybe it makes sense to have this also just as an option.
Note also that the loss accumulation over the dataset and handling of calculating the correct average (mean) is handled by RETURNN. All such losses would just yield a vector of shape [B] or [B,T].