-
Notifications
You must be signed in to change notification settings - Fork 5.9k
add trainer.stop and fix a bug for train_by_parallel_executor #10762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| ''' | ||
| if float(test_metrics[0]) < 20.0: | ||
| if isinstance(event, fluid.EndStepEvent): | ||
| if event.step == 10: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this step for epoch? Or is it more like for iterations (different mini-batches across epochs)?
(Because we have epoch_id and step_id in the trainer code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Paddle/python/paddle/fluid/trainer.py
Lines 289 to 302 in 54ae8e4
| for epoch_id in range(num_epochs): | |
| event_handler(BeginEpochEvent(epoch_id)) | |
| for step_id, data in enumerate(reader()): | |
| begin_event = BeginStepEvent(epoch_id, step_id) | |
| event_handler(begin_event) | |
| if begin_event.fetch_metrics: | |
| metrics = exe.run(feed=data, | |
| fetch_list=[ | |
| var.name | |
| for var in self.train_func_outputs | |
| ]) | |
| else: | |
| metrics = exe.run(feed=data, fetch_list=[]) | |
| event_handler(EndStepEvent(epoch_id, step_id, metrics)) |
From the code we can see that step_id is independent for each epoch, but we can also get epoch id from EndStepEvent.epoch_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, actually I was just confused about the naming. I was thinking that event.step should represent iterations. Probably, iterations should never counter back to 0. But in this case, we have event.step based on step_id.
However, the step_id is from 0 to number of mini-batches. And after each epoch the step_id is reset to 0.
So basically the point was that could be have a separate naming convention :)
| for epoch_id in range(num_epochs): | ||
| self._train_by_any_executor(event_handler, pe, num_epochs, | ||
| reader) | ||
| self._train_by_any_executor(event_handler, pe, num_epochs, reader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So what is the reason of getting rid of the for loop? Just curious 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bug, we will do iterate on num_epoches in _train_by_any_executor
Paddle/python/paddle/fluid/trainer.py
Lines 288 to 291 in 54ae8e4
| def _train_by_any_executor(self, event_handler, exe, num_epochs, reader): | |
| for epoch_id in range(num_epochs): | |
| event_handler(BeginEpochEvent(epoch_id)) | |
| for step_id, data in enumerate(reader()): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
fix: #10754