Checkpointing #24

psfoley · 2021-06-10T23:19:07Z

This PR adds:

Optional checkpointing and resume of last completed experiment round
Allows the continuation of an experiment for small plan changes (the main use case for this is reuse of a checkpoint even if the number of rounds increase)

sarthakpati

LGTM!

brandon-edwards

Changes look good. I am having some trouble though that I will continue to look into. I am testing with a test partition that has only a few like 4 sample institutions. My test so far was to set the rounds to train to 5 and let it run until round 1 was done and killed it part way into round 2. I ran with total rounds set to 1 and 2 and saw that this was an absolute total rounds, so then set the rounds to train to 3 and let it run to complete round 2. I then get the following error on completion of round

2.TypeError Traceback (most recent call last)
in
11 device=device,
12 save_checkpoints=save_checkpoints,
---> 13 restore_from_checkpoint_folder = restore_from_checkpoint_folder)

~/repositories/PatrickChallenge/Task_1/fets_challenge/experiment.py in run_challenge_experiment(aggregation_function, choose_training_collaborators, training_hyper_parameters_for_round, validation_functions, institution_split_csv_filename, brats_training_data_parent_dir, db_store_rounds, rounds_to_train, device, save_checkpoints, restore_from_checkpoint_folder)
447
448 # run the collaborator
--> 449 collaborators[col].run_simulation()
450
451 logger.info("Collaborator {} took simulated time: {} minutes".format(col, round(t / 60, 2)))

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in run_simulation(self)
145 '{} received the following tasks: {}'.format(self.collaborator_name, tasks))
146 for task in tasks:
--> 147 self.do_task(task, round_number)
148 self.logger.info(
149 'All tasks completed on {} for round {}...'.format(

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in do_task(self, task, round_number)
218 # send the results for this tasks; delta and compression will occur in
219 # this function
--> 220 self.send_task_results(global_output_tensor_dict, round_number, task)
221
222 def get_numpy_dict_for_tensorkeys(self, tensor_keys):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in send_task_results(self, tensor_dict, round_number, task_name)
377
378 self.client.send_local_task_results(
--> 379 self.collaborator_name, round_number, task_name, data_size, named_tensors)
380
381 def nparray_to_named_tensor(self, tensor_key, nparray):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in send_local_task_results(self, collaborator_name, round_number, task_name, data_size, named_tensors)
504 self.collaborator_tasks_results[task_key] = task_results
505
--> 506 self._end_of_task_check(task_name)
507
508 def _process_named_tensor(self, named_tensor, collaborator_name):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _end_of_task_check(self, task_name)
615 if self._is_task_done(task_name):
616 # now check for the end of the round
--> 617 self._end_of_round_check()
618
619 def _prepare_trained(self, tensor_name, origin, round_number, report, agg_results):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _end_of_round_check(self)
803 all_tasks = self.assigner.get_all_tasks_for_round(self.round_number)
804 for task_name in all_tasks:
--> 805 self._compute_validation_related_task_metrics(task_name)
806
807 # Once all of the task results have been processed

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _compute_validation_related_task_metrics(self, task_name)
766 if agg_function:
767 self.logger.info('{0} {1}:\t{2:.4f}'.format(
--> 768 agg_function, agg_tensor_name, agg_results)
769 )
770 else:

TypeError: unsupported format string passed to tuple.format

I don't yet have an understanding of what happened, which is why I'm given so much information about what I did. I will continue to test tomorrow though, and wanted to let you guys know what I've found so far
.

brandon-edwards · 2021-06-11T16:56:19Z

I'm a little confused about the following in the first cell of the 'Setting up of experiment' section of the notebook: "Please note that if you start from an earlier round, checkpoints will be overwritten when 'save_checkpoints' is set to True." It is the 'earlier' that sounds confusing to me. @psfoley could I replace with, 'Please note that if you restore from a checkpoint, and save checkpoint is set to True, then the checkpoint you restore from will be subsequently overwritten'? Or is this not the meaning?

psfoley · 2021-06-11T17:08:09Z

@brandon-edwards thanks for the review. I will update the description with that blurb.

That error is interesting. I'm not changing the metrics reported from round to round, but admittedly I've been validating this PR with the small_split.csv partition.

Changing the comments on partitioning.

Format and small wording change

…o HEAD

brandon-edwards · 2021-06-11T17:20:11Z

Committed to 'checkpoint' branch, changing the wording regarding data partitioning.

brandon-edwards · 2021-06-11T20:28:35Z

Issue mentioned above resolved, I will merge.

psfoley and others added 2 commits June 10, 2021 22:08

Adds checkpointing functionality

e5c6dd8

Merge main into branch

1827379

psfoley requested review from brandon-edwards and sarthakpati June 10, 2021 23:19

sarthakpati previously approved these changes Jun 11, 2021

View reviewed changes

brandon-edwards requested changes Jun 11, 2021

View reviewed changes

Update FeTS_Challenge.ipynb

7cd3c09

Changing the comments on partitioning.

brandon-edwards dismissed sarthakpati’s stale review via 7cd3c09 June 11, 2021 17:16

brandon-edwards and others added 3 commits June 11, 2021 10:18

Update FeTS_Challenge.ipynb

e2627ac

Format and small wording change

Updated checkpoint resume behavior description in notebook

1cef1b9

Merge branch 'checkpoint' of https://github.com/psfoley/Challenge int…

af42168

…o HEAD

brandon-edwards approved these changes Jun 11, 2021

View reviewed changes

brandon-edwards merged commit 1a2b544 into FeTS-AI:main Jun 11, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Checkpointing #24

Checkpointing #24

Uh oh!

psfoley commented Jun 10, 2021 •

edited

Loading

Uh oh!

sarthakpati left a comment

Uh oh!

brandon-edwards left a comment

Uh oh!

brandon-edwards commented Jun 11, 2021 •

edited

Loading

Uh oh!

psfoley commented Jun 11, 2021

Uh oh!

brandon-edwards commented Jun 11, 2021

Uh oh!

brandon-edwards commented Jun 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Checkpointing #24

Checkpointing #24

Uh oh!

Conversation

psfoley commented Jun 10, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sarthakpati left a comment

Choose a reason for hiding this comment

Uh oh!

brandon-edwards left a comment

Choose a reason for hiding this comment

TypeError: unsupported format string passed to tuple.format

Uh oh!

brandon-edwards commented Jun 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

psfoley commented Jun 11, 2021

Uh oh!

brandon-edwards commented Jun 11, 2021

Uh oh!

brandon-edwards commented Jun 11, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

psfoley commented Jun 10, 2021 •

edited

Loading

brandon-edwards commented Jun 11, 2021 •

edited

Loading