Unexpected Losses with Sample Weights

When using `sample_weight`, and the default reduction `sum_over_batch_size`, the computed losses are technically correct (they are the sum divided by the batch size), but they are not what someone would want them to be. They are computed by summing the `loss * sample_weight` and dividing by the number of [mask applied] items in the tensor. That is, they are *not* computed by dividing by the sum of the sample weights.

For example,
```
keras.losses.MeanAbsoluteError()(
    y_true=np.array([[1.0], [2.0]]),
    y_pred=np.array([[2.0], [3.0]]),
    sample_weight=np.array([[0.0], [1.0]])
    ).numpy()
```
returns `0.5` not `1.0`. (The denominator of the calculation is `2.0` because there are two samples where the loss is applied, despite one of them having a sample weight of `0.0`.)

Notably, the metric version
```
keras.metrics.MeanAbsoluteError()(
    y_true=np.array([[1.0], [2.0]]),
    y_pred=np.array([[2.0], [3.0]]),
    sample_weight=np.array([[0.0], [1.0]])
    ).numpy()
```
returns `1.0` as one would expect because it divides by the sum of the sample weights.

The metrics version uses `keras.src.utils.metrics_utils.Reduction()` of `weighted_mean` by default (not `sum_over_batch_size`). However, the losses `keras.losses.Reduction()` has no such equivalent. This means the loss computes a different value from the associated metric during training.

This is a long-standing issue, but I verified this in both Keras 2.15.0 (TensorFlow 2.15.0) and 3.3.3 (TensorFlow 2.16.1).
https://colab.research.google.com/drive/1TRBeOE79kfxPwz1-C60N3IjXeLSUbgST?usp=sharing

**Should someone either change the default behavior to be the weighted mean (divide by sum of the sample weights) or add another loss reduction option that enables this?** I think this is a significant issue that affects neural network training.

Note that when a mask is applied, the function `keras.utils.losses_utils.apply_valid_mask` is used to exclude some items from the loss calculation by setting their sample weights to `0.0` *and* adjusting the denominator to only count the number of items in the tensor that pass through the mask. Therefore, in the special case of all sample weights being `1.0` but some getting masked out, the denominator is adjusted to get the effect of dividing by the sum of the sample weights rather than the "batch size". Thus, in this one special (but likely common) case, the output is what one would expect. It just doesn't work out that way when some of the included sample weights are different from `1.0`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Unexpected Losses with Sample Weights #19740

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Unexpected Losses with Sample Weights #19740

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions