State Changes from List to Tensor After Calling Compute() #2005

donglihe-hub · 2023-08-17T16:39:23Z

🐛 Bug

I was logging a custom Metric result using the logger provided by LightningModule. The codes runs on 2 GPUs with ddp. While a list state whose dist_reduce_fx is "cat" was synchronized by calling compute() , I realized the behavior in training_step() and validation_step() were different.

The state in training_step() where on_step=True and on_epoch=False (default setting for training) after calling compute() was always a list. However, in validation_step() where on_step=False and on_epoch=True (default setting for validation), the state became a Tensor. Such behavior is not explained in the doc.

To Reproduce

Metric and Logging Results

Codes

import logging

import torch
from torchmetrics import Metric
from torchmetrics.functional.retrieval import retrieval_normalized_dcg


class MyMetric(Metric):
    def __init__(self, top_k):
        super().__init__()
        self.top_k = top_k
        self.add_state("state_1", default=[], dist_reduce_fx="cat")

    def update(self, preds: Tensor, target: Tensor):
        assert preds.shape == target.shape
        self.state_1 += [self._metric(p, t) for p, t in zip(preds, target)]

    def compute(self):
        if isinstance(self.state_1, list):
            logging.warning(f"self.state_1.size(): {len(self.state_1)}")
            return torch.stack(self.state_1).mean()
        logging.warning(f"self.state_1.size(): {self.state_1.size()}")
        return self.state_1.mean()

    def _metric(self, preds: Tensor, target: Tensor):
        return retrieval_normalized_dcg(preds, target, k=self.top_k).float()

Log Outputs
Training:
WARNING:self.state_1.size(): 40

Validation:
WARNING:self.state_1.size(): torch.Size([80])

Expected behavior

The state is always what it is defined in add_state(), i.e., a list.

Environment

TorchMetrics version (and how you installed TM, e.g. conda, pip, build from source):
0.10.3 (conda-forge)
Python & PyTorch Version (e.g., 1.0):
3.10.12
Any other relevant information such as OS (e.g., Linux):
pytorch: 1.13.1
pytorch-lightning: 1.9.4
Linux: 5.15.0-78-generic

Additional context

The text was updated successfully, but these errors were encountered:

github-actions · 2023-08-17T16:39:53Z

Hi! thanks for your contribution!, great first issue!

SkafteNicki · 2023-08-23T07:00:14Z

@donglihe-hub, thanks for raising this issue.
I definitely see the problem that you have pointed out, the core problem with changing the behaviour being backwards compatibility. This behaviour has been present since the start of torchmetrics and we have internally really not considered it a problem. The solution in our own metrics are just to call the dim_zero_cat function in compute:

from torchmetrics.metric import Metric
from torchmetrics.utilities.data import dim_zero_cat
class MyMetric(Metric):
     ...
     def compute(self):
         state_1 = dim_zero_cat(self.state_1)  # the returned value will always be a tensor

which secures that everything is a tensor before doing any computation.

cc: @justusschock opinions on this issue? is it something we should change?

donglihe-hub · 2023-08-24T07:18:36Z

@donglihe-hub, thanks for raising this issue. I definitely see the problem that you have pointed out, the core problem with changing the behaviour being backwards compatibility. This behaviour has been present since the start of torchmetrics and we have internally really not considered it a problem. The solution in our own metrics are just to call the dim_zero_cat function in compute:
from torchmetrics.metric import Metric
from torchmetrics.utilities.data import dim_zero_cat
class MyMetric(Metric):
     ...
     def compute(self):
         state_1 = dim_zero_cat(self.state_1)  # the returned value will always be a tensor
which secures that everything is a tensor before doing any computation.

cc: @justusschock opinions on this issue? is it something we should change?

Thank you for your answer. I think maybe it should be mentioned in IMPLEMENTING A METRIC? Otherwise users have no way to know that before they actually running their codes and met the "bug".

SkafteNicki · 2023-08-26T17:49:09Z

@donglihe-hub you are completely right.
That page have not been updated in a long time a definite need an overhaul. Let me try to figure out some more examples and hopefully we can also figure out the issue you have reported in #2022.

donglihe-hub added bug / fix Something isn't working help wanted Extra attention is needed labels Aug 17, 2023

donglihe-hub changed the title ~~State Change from List to Tensor During Synchronization~~ State Change from List to Tensor After Calling Compute() Aug 17, 2023

donglihe-hub changed the title ~~State Change from List to Tensor After Calling Compute()~~ State Changes from List to Tensor After Calling Compute() Aug 17, 2023

Borda added the v0.10.x label Aug 25, 2023

SkafteNicki added the documentation Improvements or additions to documentation label Aug 26, 2023

SkafteNicki added this to the v1.2.0 milestone Aug 26, 2023

SkafteNicki self-assigned this Aug 26, 2023

SkafteNicki mentioned this issue Sep 6, 2023

Improvements to docs on custom implementations #2061

Merged

4 tasks

Borda closed this as completed in #2061 Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State Changes from List to Tensor After Calling Compute() #2005

State Changes from List to Tensor After Calling Compute() #2005

donglihe-hub commented Aug 17, 2023 •

edited by SkafteNicki

Loading

github-actions bot commented Aug 17, 2023

SkafteNicki commented Aug 23, 2023

donglihe-hub commented Aug 24, 2023 •

edited

Loading

SkafteNicki commented Aug 26, 2023

State Changes from List to Tensor After Calling Compute() #2005

State Changes from List to Tensor After Calling Compute() #2005

Comments

donglihe-hub commented Aug 17, 2023 • edited by SkafteNicki Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Aug 17, 2023

SkafteNicki commented Aug 23, 2023

donglihe-hub commented Aug 24, 2023 • edited Loading

SkafteNicki commented Aug 26, 2023

donglihe-hub commented Aug 17, 2023 •

edited by SkafteNicki

Loading

donglihe-hub commented Aug 24, 2023 •

edited

Loading