Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List Metric synchronization fails in corner case #2463

Closed
danny-schwartz7 opened this issue Mar 19, 2024 · 1 comment · Fixed by #2468
Closed

List Metric synchronization fails in corner case #2463

danny-schwartz7 opened this issue Mar 19, 2024 · 1 comment · Fixed by #2468
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.3.x

Comments

@danny-schwartz7
Copy link

danny-schwartz7 commented Mar 19, 2024

🐛 Bug

Empty list metrics bypass a call to torch.barrier() and cause deadlocks or other strange synchronization issues.

To Reproduce

Define the ListMetric class as follows:

import torch
import torchmetrics

class ListMetric(torchmetrics.Metric):
     full_state_update: bool = False
     
     def __init__(self):
         super().__init__(sync_on_compute=False)
         
        self.add_state("example_state", default=[], dist_reduce_fx="cat")
        
    def update(x: torch.Tensor):
        # in this example, x is a (2,)-shaped tensor and the internal state of this metric is a list of tensors of this shape
        self.example_state.append(x)
        
    def compute(x: torch.Tensor):
        return self.example_state

    def sync(
        self,
        dist_sync_fn: Optional[Callable] = None,
        process_group: Optional[Any] = None,
        should_sync: bool = True,
        distributed_available: Optional[Callable] = None,
    ) -> None:
        super().sync(
            dist_sync_fn=dist_sync_fn,
            process_group=process_group,
            should_sync=should_sync,
            distributed_available=distributed_available,
        )

        tensor_and_empty = (
            isinstance(self.example_state, torch.Tensor)
            and torch.numel(self.example_state) == 0
        )
        list_and_empty = (
            isinstance(self.example_state, List) and len(self.example_state) == 0
        )
        if tensor_and_empty or list_and_empty:
            self.example_state = []
            return
        """
        Torchmetrics has a strange quirk:
        Depending on whether it's being run in a distributed setting, the type of 'self.resolutions' may differ.
        The `dim_zero_cat()` function fixes this.

        see https://lightning.ai/docs/torchmetrics/v1.3.1/pages/implement.html "working with list states"
        """
        self.example_state = torchmetrics.utilities.dim_zero_cat(self.example_state)
        self.example_state = list(torch.split(self.example_state, 2))

In your training loop, do this with 2 or more GPUs:

def on_validation_epoch_end(self, outputs):
    lm = ListMetric()
    if self.local_rank == 0:
        lm.update(torch.zeros((2,)))
    lm.sync()  # everything but local_rank == 0 will skip the barrier inside this call to `sync()` and then the script will deadlock

Expected behavior

Empty list metrics properly sync (waiting on a barrier) with non-empty list metrics on different GPUs

Environment

Linux, torchmetrics 1.3.0.post0 from pip install

Additional context

It seems like this is caused by the line if reduction_fn == dim_zero_cat and isinstance(input_dict[attr], list) and len(input_dict[attr]) > 1: in Metric._sync_dist(), which causes return [function(x, *args, **kwargs) for x in data] in lightning_utilities.core.apply_func.py to execute. Since function contains the blocking call to wait on the barrier and the list is empty, no blocking ever occurs for GPUs whose metric states are an empty list.

@danny-schwartz7 danny-schwartz7 added bug / fix Something isn't working help wanted Extra attention is needed labels Mar 19, 2024
Copy link

Hi! thanks for your contribution!, great first issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug / fix Something isn't working help wanted Extra attention is needed v1.3.x
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
@Borda @danny-schwartz7 and others