List Metric synchronization fails in corner case #2463

danny-schwartz7 · 2024-03-19T16:44:14Z

🐛 Bug

Empty list metrics bypass a call to torch.barrier() and cause deadlocks or other strange synchronization issues.

To Reproduce

Define the ListMetric class as follows:

import torch
import torchmetrics

class ListMetric(torchmetrics.Metric):
     full_state_update: bool = False
     
     def __init__(self):
         super().__init__(sync_on_compute=False)
         
        self.add_state("example_state", default=[], dist_reduce_fx="cat")
        
    def update(x: torch.Tensor):
        # in this example, x is a (2,)-shaped tensor and the internal state of this metric is a list of tensors of this shape
        self.example_state.append(x)
        
    def compute(x: torch.Tensor):
        return self.example_state

    def sync(
        self,
        dist_sync_fn: Optional[Callable] = None,
        process_group: Optional[Any] = None,
        should_sync: bool = True,
        distributed_available: Optional[Callable] = None,
    ) -> None:
        super().sync(
            dist_sync_fn=dist_sync_fn,
            process_group=process_group,
            should_sync=should_sync,
            distributed_available=distributed_available,
        )

        tensor_and_empty = (
            isinstance(self.example_state, torch.Tensor)
            and torch.numel(self.example_state) == 0
        )
        list_and_empty = (
            isinstance(self.example_state, List) and len(self.example_state) == 0
        )
        if tensor_and_empty or list_and_empty:
            self.example_state = []
            return
        """
        Torchmetrics has a strange quirk:
        Depending on whether it's being run in a distributed setting, the type of 'self.resolutions' may differ.
        The `dim_zero_cat()` function fixes this.

        see https://lightning.ai/docs/torchmetrics/v1.3.1/pages/implement.html "working with list states"
        """
        self.example_state = torchmetrics.utilities.dim_zero_cat(self.example_state)
        self.example_state = list(torch.split(self.example_state, 2))

In your training loop, do this with 2 or more GPUs:

def on_validation_epoch_end(self, outputs):
    lm = ListMetric()
    if self.local_rank == 0:
        lm.update(torch.zeros((2,)))
    lm.sync()  # everything but local_rank == 0 will skip the barrier inside this call to `sync()` and then the script will deadlock

Expected behavior

Empty list metrics properly sync (waiting on a barrier) with non-empty list metrics on different GPUs

Environment

Linux, torchmetrics 1.3.0.post0 from pip install

Additional context

It seems like this is caused by the line if reduction_fn == dim_zero_cat and isinstance(input_dict[attr], list) and len(input_dict[attr]) > 1: in Metric._sync_dist(), which causes return [function(x, *args, **kwargs) for x in data] in lightning_utilities.core.apply_func.py to execute. Since function contains the blocking call to wait on the barrier and the list is empty, no blocking ever occurs for GPUs whose metric states are an empty list.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-03-19T16:44:37Z

Hi! thanks for your contribution!, great first issue!

danny-schwartz7 added bug / fix Something isn't working help wanted Extra attention is needed labels Mar 19, 2024

Borda added the v1.3.x label Mar 19, 2024

SkafteNicki mentioned this issue Mar 24, 2024

Fix list syncronization with partly empty lists #2468

Merged

4 tasks

SkafteNicki closed this as completed in #2468 Apr 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

List Metric synchronization fails in corner case #2463

List Metric synchronization fails in corner case #2463

danny-schwartz7 commented Mar 19, 2024 •

edited by Borda

Loading

github-actions bot commented Mar 19, 2024

List Metric synchronization fails in corner case #2463

List Metric synchronization fails in corner case #2463

Comments

danny-schwartz7 commented Mar 19, 2024 • edited by Borda Loading

🐛 Bug

To Reproduce

Expected behavior

Environment

Additional context

github-actions bot commented Mar 19, 2024

danny-schwartz7 commented Mar 19, 2024 •

edited by Borda

Loading