[tune/rllib] Add checkpoint eraser #4490

jodusan · 2019-03-27T12:48:38Z

What do these changes do?

Adds checkpoint eraser so disk doesn't get filled up.

Arguments introduced are

--keep-checkpoints-num - specifies up to how many checkpoints to keep
--checkpoint-score-attr - specifies by which parameter will the "best" checkpoints be ranked. (example episode_reward_mean). It can be specified with min- in front of the param name to rank by decreasing order (by default it is increasing example min-classification_loss).

Related issue number

#4381
#4287

AmplabJenkins · 2019-03-27T12:49:27Z

Can one of the admins verify this patch?

AmplabJenkins · 2019-03-27T15:49:46Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13298/
Test PASSed.

python/ray/tune/config_parser.py

python/ray/tune/ray_trial_executor.py

python/ray/tune/trial.py

AmplabJenkins · 2019-03-29T16:16:45Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13358/
Test FAILed.

python/ray/tune/ray_trial_executor.py

AmplabJenkins · 2019-04-01T14:57:09Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13409/
Test FAILed.

AmplabJenkins · 2019-04-01T15:10:17Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13408/
Test FAILed.

AmplabJenkins · 2019-04-01T15:25:35Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13407/
Test FAILed.

ericl · 2019-04-04T07:58:41Z

python/ray/tune/trial.py

+        self.best_checkpoint_attr_value = -float("inf") \
+            if self._cmp_greater else float("inf")
+        self.checkpoint_score_attr = checkpoint_score_attr \
+            if self._cmp_greater else checkpoint_score_attr[4:]


could add a comment that this strips off the "min-" part.

ericl

Looks good to me! One minor comment.

…t_eraser3

…checkpoint_eraser3

AmplabJenkins · 2019-04-04T12:41:49Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13515/
Test FAILed.

AmplabJenkins · 2019-04-04T12:51:54Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13516/
Test FAILed.

ericl · 2019-04-04T21:32:01Z

python/ray/tune/trial_runner.py:205: in restore
    new_trial = Trial(trial_cp["trainable_name"])
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

self = <[AttributeError("'Trial' object has no attribute 'custom_trial_name'") raised in repr()] SafeRepr object at 0x7f1b9ee50c68>
trainable_name = '_Mock', config = None, trial_id = None
local_dir = '/root/ray_results', experiment_tag = '', resources = None
stopping_criterion = None, checkpoint_freq = 0, checkpoint_at_end = False
keep_checkpoints_num = None, checkpoint_score_attr = None, export_formats = None
restore_path = None, upload_dir = None, trial_name_creator = None
loggers = None, sync_function = None, max_failures = 0

    def __init__(self,
                 trainable_name,
                 config=None,
                 trial_id=None,
                 local_dir=DEFAULT_RESULTS_DIR,
                 experiment_tag="",
                 resources=None,
                 stopping_criterion=None,
                 checkpoint_freq=0,
                 checkpoint_at_end=False,
                 keep_checkpoints_num=None,
                 checkpoint_score_attr=None,
                 export_formats=None,
                 restore_path=None,
                 upload_dir=None,
                 trial_name_creator=None,
                 loggers=None,
                 sync_function=None,
                 max_failures=0):
        """Initialize a new trial.
    
        The args here take the same meaning as the command line flags defined
        in ray.tune.config_parser.
        """
    
        Trial._registration_check(trainable_name)
        # Trial config
        self.trainable_name = trainable_name
        self.config = config or {}
        self.local_dir = os.path.expanduser(local_dir)
        self.experiment_tag = experiment_tag
        self.resources = (
            resources
            or self._get_trainable_cls().default_resource_request(self.config))
        self.stopping_criterion = stopping_criterion or {}
        self.upload_dir = upload_dir
        self.loggers = loggers
        self.sync_function = sync_function
        validate_sync_function(sync_function)
        self.verbose = True
        self.max_failures = max_failures
    
        # Local trial state that is updated during the run
        self.last_result = {}
        self.last_update_time = -float("inf")
        self.checkpoint_freq = checkpoint_freq
        self.checkpoint_at_end = checkpoint_at_end
    
        self.history = []
        self.keep_checkpoints_num = keep_checkpoints_num
>       self._cmp_greater = not checkpoint_score_attr.startswith("min-")
E       AttributeError: 'NoneType' object has no attribute 'startswith'

…checkpoint_eraser3

jodusan · 2019-04-05T10:31:01Z

@ericl Updated!

AmplabJenkins · 2019-04-05T11:55:56Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13588/
Test FAILed.

richardliaw · 2019-04-05T17:27:01Z

python/ray/tune/trial.py

@@ -495,6 +505,27 @@ def update_last_result(self, result, terminate=False):
        self.last_update_time = time.time()
        self.result_logger.on_result(self.last_result)

+    def compare_checkpoints(self, attr_mean):
+        """Compares two checkpoints based on the attribute attr_mean param.


Can you change this to match the Google Style for docstrings? https://sphinxcontrib-napoleon.readthedocs.io/en/latest/example_google.html

richardliaw · 2019-04-05T17:34:00Z

jenkins retest this please

AmplabJenkins · 2019-04-05T17:35:48Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-Perf-Integration-PRB/384/
Test PASSed.

AmplabJenkins · 2019-04-05T19:42:31Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13590/
Test FAILed.

ericl · 2019-04-06T07:40:11Z

jenkins retest this please

Also some lint errors.

AmplabJenkins · 2019-04-07T02:19:20Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/Ray-PRB/13607/
Test FAILed.

ericl · 2019-04-07T03:02:04Z

Tests look good, merging. Thanks!

Init branch

9421009

ericl reviewed Mar 28, 2019

View reviewed changes

python/ray/tune/config_parser.py Outdated Show resolved Hide resolved

python/ray/tune/ray_trial_executor.py Show resolved Hide resolved

python/ray/tune/trial.py Outdated Show resolved Hide resolved

Set default attr training_iteration, simplify class attributes.

41c84ca

ericl requested changes Mar 30, 2019

View reviewed changes

python/ray/tune/ray_trial_executor.py Show resolved Hide resolved

jodusan added 4 commits April 1, 2019 14:24

Simplify checkpoint comparing

73b28ed

merge master

21fa087

Remove blank space

017f50c

Remove outdated comment

c8e46c2

ericl self-assigned this Apr 3, 2019

devin-petersohn mentioned this pull request Apr 3, 2019

Add checkpoint eraserv2 #4380

Closed

ericl reviewed Apr 4, 2019

View reviewed changes

ericl approved these changes Apr 4, 2019

View reviewed changes

jodusan added 4 commits April 4, 2019 10:33

Run scripts/format

8e695f8

Merge branch 'master' of github.com:wingman-ai/ray into add_checkpoin…

0436a94

…t_eraser3

Add comment for strip

de0232c

Merge branch 'master' of https://github.com/ray-project/ray into add_…

791b288

…checkpoint_eraser3

jodusan added 2 commits April 5, 2019 11:24

Checkpoint_score_attr change default value

cd06f37

Merge branch 'master' of https://github.com/ray-project/ray into add_…

ec4e942

…checkpoint_eraser3

richardliaw reviewed Apr 5, 2019

View reviewed changes

richardliaw changed the title ~~Add checkpoint eraser~~ [tune/rllib] Add checkpoint eraser Apr 5, 2019

This was referenced Apr 5, 2019

[rllib] Possible checkpointing improvements #4570

Closed

[rllib] What is the proper way to restore checkpoint for fine-tuning / rendering / evaluation of a trained agent based on example/multiagent_cartpole.py? #4569

Closed

lint

c6c5681

ericl merged commit 820c71b into ray-project:master Apr 7, 2019

jodusan deleted the add_checkpoint_eraser3 branch April 8, 2019 13:19

This was referenced Apr 29, 2019

[tune] Cannot restore checkpoint for experiment #4714

Closed

[tune] Fix error when restore checkpoint at tune.run() #4715

Closed

[tune] fix restore error at tune.run() #4733

Merged

richardliaw mentioned this pull request Jul 9, 2019

[Tune] Revisiting checkpointing policy #4287

Closed

[tune/rllib] Add checkpoint eraser #4490

[tune/rllib] Add checkpoint eraser #4490

Uh oh!

Conversation

jodusan commented Mar 27, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue number

Uh oh!

AmplabJenkins commented Mar 27, 2019

Uh oh!

AmplabJenkins commented Mar 27, 2019

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmplabJenkins commented Mar 29, 2019

Uh oh!

Uh oh!

AmplabJenkins commented Apr 1, 2019

Uh oh!

AmplabJenkins commented Apr 1, 2019

Uh oh!

AmplabJenkins commented Apr 1, 2019

Uh oh!

ericl Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

jodusan Apr 4, 2019

Choose a reason for hiding this comment

Uh oh!

ericl left a comment

Choose a reason for hiding this comment

Uh oh!

AmplabJenkins commented Apr 4, 2019

Uh oh!

AmplabJenkins commented Apr 4, 2019

Uh oh!

ericl commented Apr 4, 2019

Uh oh!

jodusan commented Apr 5, 2019

Uh oh!

AmplabJenkins commented Apr 5, 2019

Uh oh!

richardliaw Apr 5, 2019

Choose a reason for hiding this comment

Uh oh!

richardliaw commented Apr 5, 2019

Uh oh!

AmplabJenkins commented Apr 5, 2019

Uh oh!

AmplabJenkins commented Apr 5, 2019

Uh oh!

ericl commented Apr 6, 2019

Uh oh!

AmplabJenkins commented Apr 7, 2019

Uh oh!

ericl commented Apr 7, 2019

Uh oh!

Uh oh!

jodusan commented Mar 27, 2019 •

edited

Loading