Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[tune] Treat checkpoints with nan value as worst #23862

Merged
merged 6 commits into from
Apr 14, 2022

Conversation

Yard1
Copy link
Member

@Yard1 Yard1 commented Apr 12, 2022

Why are these changes needed?

Changes the logic in CheckpointManager to consider checkpoints with nan value of the metric as worst values, meaning they will be deleted first if keep_checkpoints_num is set.

Related issue number

Closes #23856

Checks

  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@Yard1
Copy link
Member Author

Yard1 commented Apr 12, 2022

cc @XuehaiPan

@XuehaiPan
Copy link
Contributor

It needs similar changes with ray/train/checkpoint.py.

Copy link
Contributor

@amogkam amogkam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @Yard1 and thanks for raising the issue @XuehaiPan!

python/ray/train/checkpoint.py Outdated Show resolved Hide resolved
python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved
python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved
@Yard1 Yard1 requested a review from amogkam April 12, 2022 16:52
python/ray/train/checkpoint.py Show resolved Hide resolved
@@ -5,7 +5,7 @@
from typing import Any, Callable, Optional

from ray.tune.result import NODE_IP
from ray.tune.utils.util import flatten_dict
from ray.tune.utils.util import flatten_dict, is_nan
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the ml_utils is_nan directly and remove it from tune.utils?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine to use an alias - we have a precedent for that already.

Copy link
Contributor

@krfricke krfricke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally good, one suggestion

python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved
@amogkam
Copy link
Contributor

amogkam commented Apr 13, 2022

This looks great @Yard1! Can we resolve the conflicts, and then I can merge!

Comment on lines +205 to +207
if self._checkpoint_score_desc:
priority = -priority
return (not is_nan(priority), priority, checkpoint.order)
Copy link
Contributor

@XuehaiPan XuehaiPan Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When priority is nan, sorting by tuple key:

(not is_nan(priority), priority, checkpoint.order)

won't give the correct order by checkpoint.order. Because both nan < nan and nan > nan return False.

Suggested change
if self._checkpoint_score_desc:
priority = -priority
return (not is_nan(priority), priority, checkpoint.order)
if self._checkpoint_score_desc:
priority = -priority
if is_nan(priority):
return (0, checkpoint.order, priority)
return (1, priority, checkpoint.order)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it does:

>>> import numpy as np
>>> (False, np.nan, 3) < (False, np.nan, 4)
True
>>> (False, np.nan, 4) < (False, np.nan, 3)
False

Copy link
Contributor

@XuehaiPan XuehaiPan Apr 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it does:

>>> import numpy as np
>>> (False, np.nan, 3) < (False, np.nan, 4)
True
>>> (False, np.nan, 4) < (False, np.nan, 3)
False

Seems that tuple.__lt__ skips items when lhs[i] is rhs[i].

In [1]: (False, float('nan'), 3) < (False, float('nan'), 4)
Out[1]: False

In [2]: (False, float('nan'), 3) > (False, float('nan'), 4)
Out[2]: False

In [3]: (False, float('nan'), 4) < (False, float('nan'), 3)
Out[3]: False

In [4]: (False, float('nan'), 4) > (False, float('nan'), 3)
Out[4]: False

In [5]: import numpy as np

In [6]: (False, np.nan, 3) < (False, np.nan, 4)
Out[6]: True

In [7]: (False, np.nan, 3) > (False, np.nan, 4)
Out[7]: False

In [8]: import math

In [9]: (False, math.nan, 3) < (False, math.nan, 4)
Out[9]: True

In [10]: (False, math.nan, 3) > (False, math.nan, 4)
Out[10]: False

In [11]: float('nan') is float('nan')
Out[11]: False

In [12]: np.nan is np.nan
Out[12]: True

In [13]: math.nan is math.nan
Out[13]: True

In [14]: float('nan') is math.nan
Out[14]: False

In [15]: math.nan is np.nan
Out[15]: False

In [16]: (False, math.nan, 3) < (False, np.nan, 4)
Out[16]: False

np.nan is a single variable, but each call of float('nan') will create a new variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah thanks, you're right. It seems indeed like float('nan') <= float('nan') is False, unlike for np.

Fix here: #23909

@krfricke krfricke merged commit 52eaf02 into ray-project:master Apr 14, 2022
@Yard1 Yard1 deleted the fix_nan_best_checkpoint branch April 14, 2022 09:38
krfricke added a commit that referenced this pull request Apr 14, 2022
Following #23862, there was an uncaught bug when comparing nan-priority checkpoints. This is because float("nan") <= float("nan") is always False (unlike e.g. np.nan <= np.nan, which is True).

This PR fixes this bug and adds a new test to ensure correct behavior.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Tune][Train] [Bug] nan metric filtering in checkpoint managers
5 participants