[tune] Treat checkpoints with nan value as worst #23862

Yard1 · 2022-04-12T12:49:41Z

Why are these changes needed?

Changes the logic in CheckpointManager to consider checkpoints with nan value of the metric as worst values, meaning they will be deleted first if keep_checkpoints_num is set.

Related issue number

Closes #23856

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Yard1 · 2022-04-12T12:49:54Z

cc @XuehaiPan

XuehaiPan · 2022-04-12T12:55:01Z

It needs similar changes with ray/train/checkpoint.py.

python/ray/tune/checkpoint_manager.py

amogkam

Thanks for the PR @Yard1 and thanks for raising the issue @XuehaiPan!

python/ray/train/checkpoint.py

python/ray/tune/checkpoint_manager.py

python/ray/train/checkpoint.py

amogkam · 2022-04-12T20:10:40Z

python/ray/tune/checkpoint_manager.py

@@ -5,7 +5,7 @@
 from typing import Any, Callable, Optional

 from ray.tune.result import NODE_IP
-from ray.tune.utils.util import flatten_dict
+from ray.tune.utils.util import flatten_dict, is_nan


Should we use the ml_utils is_nan directly and remove it from tune.utils?

I think it's fine to use an alias - we have a precedent for that already.

python/ray/tune/tests/test_checkpoint_manager.py

python/ray/train/checkpoint.py

python/ray/tune/tests/test_checkpoint_manager.py

python/ray/tune/checkpoint_manager.py

krfricke

Generally good, one suggestion

python/ray/tune/checkpoint_manager.py

amogkam · 2022-04-13T22:19:25Z

This looks great @Yard1! Can we resolve the conflicts, and then I can merge!

XuehaiPan · 2022-04-14T05:03:42Z

python/ray/tune/checkpoint_manager.py

+        if self._checkpoint_score_desc:
+            priority = -priority
+        return (not is_nan(priority), priority, checkpoint.order)


When priority is nan, sorting by tuple key:

(not is_nan(priority), priority, checkpoint.order)

won't give the correct order by checkpoint.order. Because both nan < nan and nan > nan return False.

Suggested change

if self._checkpoint_score_desc:

priority = -priority

return (not is_nan(priority), priority, checkpoint.order)

if self._checkpoint_score_desc:

priority = -priority

if is_nan(priority):

return (0, checkpoint.order, priority)

return (1, priority, checkpoint.order)

Actually it does:

>>> import numpy as np >>> (False, np.nan, 3) < (False, np.nan, 4) True >>> (False, np.nan, 4) < (False, np.nan, 3) False

Actually it does:

>>> import numpy as np >>> (False, np.nan, 3) < (False, np.nan, 4) True >>> (False, np.nan, 4) < (False, np.nan, 3) False

Seems that tuple.__lt__ skips items when lhs[i] is rhs[i].

In [1]: (False, float('nan'), 3) < (False, float('nan'), 4) Out[1]: False In [2]: (False, float('nan'), 3) > (False, float('nan'), 4) Out[2]: False In [3]: (False, float('nan'), 4) < (False, float('nan'), 3) Out[3]: False In [4]: (False, float('nan'), 4) > (False, float('nan'), 3) Out[4]: False In [5]: import numpy as np In [6]: (False, np.nan, 3) < (False, np.nan, 4) Out[6]: True In [7]: (False, np.nan, 3) > (False, np.nan, 4) Out[7]: False In [8]: import math In [9]: (False, math.nan, 3) < (False, math.nan, 4) Out[9]: True In [10]: (False, math.nan, 3) > (False, math.nan, 4) Out[10]: False In [11]: float('nan') is float('nan') Out[11]: False In [12]: np.nan is np.nan Out[12]: True In [13]: math.nan is math.nan Out[13]: True In [14]: float('nan') is math.nan Out[14]: False In [15]: math.nan is np.nan Out[15]: False In [16]: (False, math.nan, 3) < (False, np.nan, 4) Out[16]: False

np.nan is a single variable, but each call of float('nan') will create a new variable.

Ah thanks, you're right. It seems indeed like float('nan') <= float('nan') is False, unlike for np.

Fix here: #23909

Following #23862, there was an uncaught bug when comparing nan-priority checkpoints. This is because float("nan") <= float("nan") is always False (unlike e.g. np.nan <= np.nan, which is True). This PR fixes this bug and adds a new test to ensure correct behavior.

[tune] Treat checkpoints with nan value as worst

e8ed4b8

Yard1 requested review from amogkam, krfricke and xwjiang2010 April 12, 2022 12:49

Yard1 assigned amogkam, krfricke and xwjiang2010 Apr 12, 2022

XuehaiPan reviewed Apr 12, 2022

View reviewed changes

python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved

Add logic to train

8d13b35

amogkam reviewed Apr 12, 2022

View reviewed changes

python/ray/train/checkpoint.py Outdated Show resolved Hide resolved

python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved

python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved

Use tuple approach, move is_nan to ml_utils

cc3391e

Yard1 requested a review from amogkam April 12, 2022 16:52

amogkam approved these changes Apr 12, 2022

View reviewed changes

Yard1 commented Apr 12, 2022

View reviewed changes

python/ray/train/checkpoint.py Show resolved Hide resolved

python/ray/tune/tests/test_checkpoint_manager.py Show resolved Hide resolved

python/ray/tune/checkpoint_manager.py Show resolved Hide resolved

Apply suggestions from code review

219da98

krfricke approved these changes Apr 13, 2022

View reviewed changes

python/ray/tune/checkpoint_manager.py Outdated Show resolved Hide resolved

Add checkpoint.order

195e568

XuehaiPan reviewed Apr 14, 2022

View reviewed changes

Merge branch 'master' into fix_nan_best_checkpoint

6b84693

krfricke merged commit 52eaf02 into ray-project:master Apr 14, 2022

Yard1 deleted the fix_nan_best_checkpoint branch April 14, 2022 09:38

krfricke mentioned this pull request Apr 14, 2022

[tune] Fix checkpoint sorting with nan values #23909

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[tune] Treat checkpoints with nan value as worst #23862

[tune] Treat checkpoints with nan value as worst #23862

Yard1 commented Apr 12, 2022

Yard1 commented Apr 12, 2022

XuehaiPan commented Apr 12, 2022

amogkam left a comment

amogkam Apr 12, 2022

Yard1 Apr 12, 2022

krfricke left a comment

amogkam commented Apr 13, 2022

XuehaiPan Apr 14, 2022 •

edited

Loading

krfricke Apr 14, 2022

XuehaiPan Apr 14, 2022 •

edited

Loading

krfricke Apr 14, 2022

[tune] Treat checkpoints with nan value as worst #23862

[tune] Treat checkpoints with nan value as worst #23862

Conversation

Yard1 commented Apr 12, 2022

Why are these changes needed?

Related issue number

Checks

Yard1 commented Apr 12, 2022

XuehaiPan commented Apr 12, 2022

amogkam left a comment

Choose a reason for hiding this comment

amogkam Apr 12, 2022

Choose a reason for hiding this comment

Yard1 Apr 12, 2022

Choose a reason for hiding this comment

krfricke left a comment

Choose a reason for hiding this comment

amogkam commented Apr 13, 2022

XuehaiPan Apr 14, 2022 • edited Loading

Choose a reason for hiding this comment

krfricke Apr 14, 2022

Choose a reason for hiding this comment

XuehaiPan Apr 14, 2022 • edited Loading

Choose a reason for hiding this comment

krfricke Apr 14, 2022

Choose a reason for hiding this comment

XuehaiPan Apr 14, 2022 •

edited

Loading

XuehaiPan Apr 14, 2022 •

edited

Loading