-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Raise an error if sync_dist
is set when logging torchmetrics
#13903
Conversation
An alternate solution (which I think is better) can be to set these values ( thoughts? @awaelchli @carmocca |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
An alternate solution (which I think is better) can be to set these values (sync_dist, reduce_fx, sync_dist_group) for torchmetrics instance internally with an info message without raising any such errors. For this, we might have to make some changes to the Metric class.
Why is it better? Failing should be fine and it makes it very clear that they shouldn't be passed.
we might have to make some changes to the Metric class
Why?
and not isinstance(value, Tensor) | ||
and (isinstance(value, Metric) or any(isinstance(v, Metric) for v in value.values())) | ||
): | ||
raise MisconfigurationException( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would personally put this in the LightningModule.log
method where we have all the other validation checks
just for better UX. else the user has to take care of setting this at two different places in they are using both torchmetrics and tensors. Or even if they are using torchmetrics, it makes it easy for them since PL will handle it and it's their intention too. Just makes it easy for them since both are lightning products. The changes that will be required are just the setters/getters for these arguments. |
I look at this from a different angle. The default for sync_dist in the log method is False. Which means when unset, it indicates that we will never communicate with different processes to compute the metric. When logging a TorchMetric however, this is clearly not the case. On the other hand, setting sync_dist=True explicitly and logging a TorchMetric does make sense, even if it doesn't trigger the exact same logic. It makes it appear that it is being "ignored", and raising an error here would prevent anyone to combine it with logging both a scalar and a TorchMetric. One can argue either way, I think it is up to us to determine whether this is considered a convenience or a misleading setting. If we decide to produce an error like proposed in this PR, we would additionally need to set the default from False to None and also error when it gets set to False. |
@awaelchli self.log('some_val', scalar_val, sync_dist=True) <- does not raise an error
self.log('some_tm_instance, torchmetric_instance, sync_dist=True) <- raises an error which means users should unset sync_dist for the second log. I can add the key name to make it more explicit. And this can easily be caused with Also, the only reason I raised this PR is to avoid the silent behavior here and have seen concerns regarding this over slack/discussions. |
My comment was about log_dict, when different types of values get submitted in a dict together.
Do you agree with this or not? |
yes, in that case, user has to separate them and make 2 calls. That's why I suggested this #13903 (comment), which will end up with no errors.
why do we need to raise an error when it's False? with DDP, we are not forcing them to sync always. |
When using torchmetrics, it will sync whether you want it or not. Internally inside torchmetrics. One can chose whether it happens additionally on step or not. A user logging a metric and setting Honestly, I wasn't sure in some of the statements I made earlier and had a hard time looking at the source code behind our result objects. I don't feel very confident, perhaps some of these checks are already done internally, but I wouldn't know where to look. |
True, sorry, I noticed this later. We can make it
we can determine that based on
if you are talking about:
|
This is a valuable improvement but low priority for the team. I created an issue (#15588) to track this idea so that somebody can pick it up. Closing the PR. |
What does this PR do?
sync_dist is ignored when
torchmetrics
instances are logged.Does your PR introduce any breaking changes? If yes, please list them.
Before submitting
PR review
Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:
Did you have fun?
Make sure you had fun coding 🙃