-
Notifications
You must be signed in to change notification settings - Fork 31
Checkpointing #24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Checkpointing #24
Conversation
sarthakpati
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
brandon-edwards
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good. I am having some trouble though that I will continue to look into. I am testing with a test partition that has only a few like 4 sample institutions. My test so far was to set the rounds to train to 5 and let it run until round 1 was done and killed it part way into round 2. I ran with total rounds set to 1 and 2 and saw that this was an absolute total rounds, so then set the rounds to train to 3 and let it run to complete round 2. I then get the following error on completion of round
2.TypeError Traceback (most recent call last)
in
11 device=device,
12 save_checkpoints=save_checkpoints,
---> 13 restore_from_checkpoint_folder = restore_from_checkpoint_folder)
~/repositories/PatrickChallenge/Task_1/fets_challenge/experiment.py in run_challenge_experiment(aggregation_function, choose_training_collaborators, training_hyper_parameters_for_round, validation_functions, institution_split_csv_filename, brats_training_data_parent_dir, db_store_rounds, rounds_to_train, device, save_checkpoints, restore_from_checkpoint_folder)
447
448 # run the collaborator
--> 449 collaborators[col].run_simulation()
450
451 logger.info("Collaborator {} took simulated time: {} minutes".format(col, round(t / 60, 2)))
~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in run_simulation(self)
145 '{} received the following tasks: {}'.format(self.collaborator_name, tasks))
146 for task in tasks:
--> 147 self.do_task(task, round_number)
148 self.logger.info(
149 'All tasks completed on {} for round {}...'.format(
~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in do_task(self, task, round_number)
218 # send the results for this tasks; delta and compression will occur in
219 # this function
--> 220 self.send_task_results(global_output_tensor_dict, round_number, task)
221
222 def get_numpy_dict_for_tensorkeys(self, tensor_keys):
~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/collaborator/collaborator.py in send_task_results(self, tensor_dict, round_number, task_name)
377
378 self.client.send_local_task_results(
--> 379 self.collaborator_name, round_number, task_name, data_size, named_tensors)
380
381 def nparray_to_named_tensor(self, tensor_key, nparray):
~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in send_local_task_results(self, collaborator_name, round_number, task_name, data_size, named_tensors)
504 self.collaborator_tasks_results[task_key] = task_results
505
--> 506 self._end_of_task_check(task_name)
507
508 def _process_named_tensor(self, named_tensor, collaborator_name):
~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _end_of_task_check(self, task_name)
615 if self._is_task_done(task_name):
616 # now check for the end of the round
--> 617 self._end_of_round_check()
618
619 def _prepare_trained(self, tensor_name, origin, round_number, report, agg_results):
~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _end_of_round_check(self)
803 all_tasks = self.assigner.get_all_tasks_for_round(self.round_number)
804 for task_name in all_tasks:
--> 805 self._compute_validation_related_task_metrics(task_name)
806
807 # Once all of the task results have been processed
~/virtual/fets_challenge_test/lib/python3.6/site-packages/openfl/component/aggregator/aggregator.py in _compute_validation_related_task_metrics(self, task_name)
766 if agg_function:
767 self.logger.info('{0} {1}:\t{2:.4f}'.format(
--> 768 agg_function, agg_tensor_name, agg_results)
769 )
770 else:
TypeError: unsupported format string passed to tuple.format
I don't yet have an understanding of what happened, which is why I'm given so much information about what I did. I will continue to test tomorrow though, and wanted to let you guys know what I've found so far
.
|
I'm a little confused about the following in the first cell of the 'Setting up of experiment' section of the notebook: "Please note that if you start from an earlier round, checkpoints will be overwritten when 'save_checkpoints' is set to True." It is the 'earlier' that sounds confusing to me. @psfoley could I replace with, 'Please note that if you restore from a checkpoint, and save checkpoint is set to True, then the checkpoint you restore from will be subsequently overwritten'? Or is this not the meaning? |
|
@brandon-edwards thanks for the review. I will update the description with that blurb. That error is interesting. I'm not changing the metrics reported from round to round, but admittedly I've been validating this PR with the |
Changing the comments on partitioning.
Format and small wording change
|
Committed to 'checkpoint' branch, changing the wording regarding data partitioning. |
|
Issue mentioned above resolved, I will merge. |
This PR adds: