Description
🐛 Bug
The current implementation of GeneralizedDiceScore
yields scores of 0.0
for samples that don't contain a particular class when calculating class-wise metrics via per_class=True
.
This leads to very low dice scores, particularly for rare classes and therefore makes the dice scores between classes incomparable.
To Reproduce
The following code sample calculates class-wise scores of tensor([0.2500, 0.2500, 0.0000])
, even though all the predictions match the targets:
Code sample
import torch
from torchmetrics.segmentation import GeneralizedDiceScore
from torchmetrics.segmentation import DiceScore
N_SAMPLES = 4
N_CLASSES = 3
target = torch.full((N_SAMPLES, N_CLASSES, 128, 128), 0, dtype=torch.int8)
preds = torch.full((N_SAMPLES, N_CLASSES, 128, 128), 0, dtype=torch.int8)
target[0, 0], preds[0, 0] = 1, 1
target[2, 1], preds[2, 1] = 1, 1
generalized_dice = GeneralizedDiceScore(num_classes=3, per_class=True, include_background=True)
print(generalized_dice(preds, target))
Expected behavior
I'd expect the above code sample to return [1.0, 1.0, nan]
for the class-wise scores (nan
for the third class, given that this class is not present in any of the samples, therefore returning a 1.0 score might also be misleading). Also, samples where the class doesn't occur should not contribute to the dice score of that class.
Environment
- TorchMetrics version (if build from source, add commit SHA):
1.6.0
- Python & PyTorch Version (e.g., 1.0):
3.11.10
- Any other relevant information such as OS (e.g., Linux): macOS 15.1.1 (24B91)
Additional context
Very similar to issue #2850.