add trainer.stop and fix a bug for train_by_parallel_executor #10762

jacquesqiao · 2018-05-18T04:00:06Z

sidgoyal78 · 2018-05-18T04:08:42Z

python/paddle/fluid/tests/book/high-level-api/fit_a_line/test_fit_a_line.py

-            '''
-            if float(test_metrics[0]) < 20.0:
+        if isinstance(event, fluid.EndStepEvent):
+            if event.step == 10:


Is this step for epoch? Or is it more like for iterations (different mini-batches across epochs)?
(Because we have epoch_id and step_id in the trainer code)

Paddle/python/paddle/fluid/trainer.py

Lines 289 to 302 in 54ae8e4

for epoch_id in range(num_epochs):

event_handler(BeginEpochEvent(epoch_id))

for step_id, data in enumerate(reader()):

begin_event = BeginStepEvent(epoch_id, step_id)

event_handler(begin_event)

if begin_event.fetch_metrics:

metrics = exe.run(feed=data,

fetch_list=[

var.name

for var in self.train_func_outputs

])

else:

metrics = exe.run(feed=data, fetch_list=[])

event_handler(EndStepEvent(epoch_id, step_id, metrics))

From the code we can see that step_id is independent for each epoch, but we can also get epoch id from EndStepEvent.epoch_id

Yeah, actually I was just confused about the naming. I was thinking that event.step should represent iterations. Probably, iterations should never counter back to 0. But in this case, we have event.step based on step_id.

However, the step_id is from 0 to number of mini-batches. And after each epoch the step_id is reset to 0.

So basically the point was that could be have a separate naming convention :)

daming-lu · 2018-05-18T04:14:05Z

python/paddle/fluid/trainer.py

-            for epoch_id in range(num_epochs):
-                self._train_by_any_executor(event_handler, pe, num_epochs,
-                                            reader)
+            self._train_by_any_executor(event_handler, pe, num_epochs, reader)


So what is the reason of getting rid of the for loop? Just curious 😁

This is a bug, we will do iterate on num_epoches in _train_by_any_executor

Paddle/python/paddle/fluid/trainer.py

Lines 288 to 291 in 54ae8e4

def _train_by_any_executor(self, event_handler, exe, num_epochs, reader):

for epoch_id in range(num_epochs):

event_handler(BeginEpochEvent(epoch_id))

for step_id, data in enumerate(reader()):

add trainer.stop and fix a bug for train_by_parallel_executor

d02cd0b

jacquesqiao requested review from daming-lu, helinwang and reyoung May 18, 2018 04:01

jacquesqiao mentioned this pull request May 18, 2018

Image classification #10738

Merged

sidgoyal78 reviewed May 18, 2018

View reviewed changes

daming-lu reviewed May 18, 2018

View reviewed changes

daming-lu approved these changes May 18, 2018

View reviewed changes

daming-lu merged commit eb7d875 into PaddlePaddle:develop May 18, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

add trainer.stop and fix a bug for train_by_parallel_executor #10762

add trainer.stop and fix a bug for train_by_parallel_executor #10762

Uh oh!

jacquesqiao commented May 18, 2018 •

edited

Loading

Uh oh!

sidgoyal78 May 18, 2018 •

edited

Loading

Uh oh!

jacquesqiao May 18, 2018 •

edited

Loading

Uh oh!

sidgoyal78 May 18, 2018

Uh oh!

daming-lu May 18, 2018

Uh oh!

jacquesqiao May 18, 2018 •

edited

Loading

Uh oh!

daming-lu May 18, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	for epoch_id in range(num_epochs):
	event_handler(BeginEpochEvent(epoch_id))
	for step_id, data in enumerate(reader()):
	begin_event = BeginStepEvent(epoch_id, step_id)
	event_handler(begin_event)
	if begin_event.fetch_metrics:
	metrics = exe.run(feed=data,
	fetch_list=[
	var.name
	for var in self.train_func_outputs
	])
	else:
	metrics = exe.run(feed=data, fetch_list=[])
	event_handler(EndStepEvent(epoch_id, step_id, metrics))

	def _train_by_any_executor(self, event_handler, exe, num_epochs, reader):
	for epoch_id in range(num_epochs):
	event_handler(BeginEpochEvent(epoch_id))
	for step_id, data in enumerate(reader()):

add trainer.stop and fix a bug for train_by_parallel_executor #10762

add trainer.stop and fix a bug for train_by_parallel_executor #10762

Uh oh!

Conversation

jacquesqiao commented May 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sidgoyal78 May 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jacquesqiao May 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sidgoyal78 May 18, 2018

Choose a reason for hiding this comment

Uh oh!

daming-lu May 18, 2018

Choose a reason for hiding this comment

Uh oh!

jacquesqiao May 18, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

daming-lu May 18, 2018

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jacquesqiao commented May 18, 2018 •

edited

Loading

sidgoyal78 May 18, 2018 •

edited

Loading

jacquesqiao May 18, 2018 •

edited

Loading

jacquesqiao May 18, 2018 •

edited

Loading