Skip to content

Conversation

@psfoley
Copy link
Collaborator

@psfoley psfoley commented Jun 10, 2021

This PR adds:

  • Optional checkpointing and resume of last completed experiment round
  • Allows the continuation of an experiment for small plan changes (the main use case for this is reuse of a checkpoint even if the number of rounds increase)

sarthakpati
sarthakpati previously approved these changes Jun 11, 2021
Copy link
Member

@sarthakpati sarthakpati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Collaborator

@brandon-edwards brandon-edwards left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I am having some trouble though that I will continue to look into. I am testing with a test partition that has only a few like 4 sample institutions. My test so far was to set the rounds to train to 5 and let it run until round 1 was done and killed it part way into round 2. I ran with total rounds set to 1 and 2 and saw that this was an absolute total rounds, so then set the rounds to train to 3 and let it run to complete round 2. I then get the following error on completion of round

2.TypeError Traceback (most recent call last)
in
11 device=device,
12 save_checkpoints=save_checkpoints,
---> 13 restore_from_checkpoint_folder = restore_from_checkpoint_folder)

~/repositories/PatrickChallenge/Task_1/fets_challenge/experiment.py in run_challenge_experiment(aggregation_function, choose_training_collaborators, training_hyper_parameters_for_round, validation_functions, institution_split_csv_filename, brats_training_data_parent_dir, db_store_rounds, rounds_to_train, device, save_checkpoints, restore_from_checkpoint_folder)
447
448 # run the collaborator
--> 449 collaborators[col].run_simulation()
450
451 logger.info("Collaborator {} took simulated time: {} minutes".format(col, round(t / 60, 2)))

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in run_simulation(self)
145 '{} received the following tasks: {}'.format(self.collaborator_name, tasks))
146 for task in tasks:
--> 147 self.do_task(task, round_number)
148 self.logger.info(
149 'All tasks completed on {} for round {}...'.format(

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in do_task(self, task, round_number)
218 # send the results for this tasks; delta and compression will occur in
219 # this function
--> 220 self.send_task_results(global_output_tensor_dict, round_number, task)
221
222 def get_numpy_dict_for_tensorkeys(self, tensor_keys):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in send_task_results(self, tensor_dict, round_number, task_name)
377
378 self.client.send_local_task_results(
--> 379 self.collaborator_name, round_number, task_name, data_size, named_tensors)
380
381 def nparray_to_named_tensor(self, tensor_key, nparray):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in send_local_task_results(self, collaborator_name, round_number, task_name, data_size, named_tensors)
504 self.collaborator_tasks_results[task_key] = task_results
505
--> 506 self._end_of_task_check(task_name)
507
508 def _process_named_tensor(self, named_tensor, collaborator_name):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _end_of_task_check(self, task_name)
615 if self._is_task_done(task_name):
616 # now check for the end of the round
--> 617 self._end_of_round_check()
618
619 def _prepare_trained(self, tensor_name, origin, round_number, report, agg_results):

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _end_of_round_check(self)
803 all_tasks = self.assigner.get_all_tasks_for_round(self.round_number)
804 for task_name in all_tasks:
--> 805 self._compute_validation_related_task_metrics(task_name)
806
807 # Once all of the task results have been processed

~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _compute_validation_related_task_metrics(self, task_name)
766 if agg_function:
767 self.logger.info('{0} {1}:\t{2:.4f}'.format(
--> 768 agg_function, agg_tensor_name, agg_results)
769 )
770 else:

TypeError: unsupported format string passed to tuple.format

I don't yet have an understanding of what happened, which is why I'm given so much information about what I did. I will continue to test tomorrow though, and wanted to let you guys know what I've found so far
.

@brandon-edwards
Copy link
Collaborator

brandon-edwards commented Jun 11, 2021

I'm a little confused about the following in the first cell of the 'Setting up of experiment' section of the notebook: "Please note that if you start from an earlier round, checkpoints will be overwritten when 'save_checkpoints' is set to True." It is the 'earlier' that sounds confusing to me. @psfoley could I replace with, 'Please note that if you restore from a checkpoint, and save checkpoint is set to True, then the checkpoint you restore from will be subsequently overwritten'? Or is this not the meaning?

@psfoley
Copy link
Collaborator Author

psfoley commented Jun 11, 2021

@brandon-edwards thanks for the review. I will update the description with that blurb.

That error is interesting. I'm not changing the metrics reported from round to round, but admittedly I've been validating this PR with the small_split.csv partition.

Changing the comments on partitioning.
@brandon-edwards
Copy link
Collaborator

Committed to 'checkpoint' branch, changing the wording regarding data partitioning.

@brandon-edwards
Copy link
Collaborator

Issue mentioned above resolved, I will merge.

@brandon-edwards brandon-edwards merged commit 1a2b544 into FeTS-AI:main Jun 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants