Losses to implement and losses naming conventions

Some modules we should implement:

* `CrossEntropy` or `CE`. Should this cover both dense and sparse targets, or do we want a separate module for the sparse case, like `SparseCrossEntropy` or so? Should this potentially also allow for logits? log-probs?
* `KL` or `KullbackLeiblerDivergence`
* `BinaryCrossEntropy` or `BCE`
* `L2Dist` (absolute or mean?) Or `MSE` or `MeanSquarredError`? (The mean reduction is over the feature axis. Not over time or batch.)
* `L1Dist` (absolute or mean?) Or `MeanL1Dist`?
* `Ctc` or `CTC` or `CtcLogProb`
* `CosineSimilarity`

I don't like the naming of the [PyTorch losses](https://pytorch.org/docs/stable/nn.html) too much here.
They have the postfix `Loss` on all of them, although these modules are generic and not necessarily just for loss computation (although that's probably their most common usage).
Also `CrossEntropyLoss` is actually log-softmax + CE together. So very much like the TF `tf.nn.sparse_softmax_cross_entropy_with_logits`.
And there is a separate `NLLLoss`. Which is just like `CrossEntropyLoss` but it doesn't take logits but log-prob instead. I find this naming to be confusing.

Also the question is how we should handle things like label smoothing. On RETURNN side (and also in TF), it is just an option to the CE-loss. On PyTorch side, it is not implemented yet as part of the official PyTorch API. Some background [here](https://stackoverflow.com/questions/55681502/label-smoothing-in-pytorch). It was only very recently added (https://github.com/pytorch/pytorch/issues/7455, https://github.com/pytorch/pytorch/pull/63122). This also adds it as an option `label_smoothing` to `CrossEntropyLoss`. An alternative would be that the user makes this more explicit, like:
```
target_prob_smooth = smooth_one_hot(targets, label_prob=0.9)
loss = cross_entropy(target_prob_smooth, out_prob)
```
Although label smoothing has become very common, so maybe it makes sense to have this also just as an option.

Note also that the loss accumulation over the dataset and handling of calculating the correct average (mean) is handled by RETURNN. All such losses would just yield a vector of shape [B] or [B,T].


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Losses to implement and losses naming conventions #38

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Losses to implement and losses naming conventions #38

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions