-
Notifications
You must be signed in to change notification settings - Fork 5.9k
Support testing during training by ParallelExecutor. #9656
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LG Overall.
Have some thoughts about the API:
Currently, ParallelExecutor has so many arguments. It's not easy for user the know which one to set for train or inference.
How about having ParallelTrainExecuor and ParallelInferExecutor that wraps ParallelExecutor.
I don't quite like sharing local_scopes. It's not possible for normal program and user has no idea what it is and what's the effect. How about:
ParallelInferExecutor(share_vars_from=train_executor)
// Create local scopes | ||
for (size_t i = 0; i < member_->places_.size(); ++i) { | ||
member_->local_scopes_.push_back(&scope->NewScope()); | ||
if (local_scopes.size() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
local_scopes.empty()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
loss_name, | ||
use_cuda, | ||
loss_name=None, | ||
use_cuda=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be True or False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify the interface.
main_program=None, | ||
startup_program=None, | ||
local_scopes=None, | ||
run_startup=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If startup_program is None, then startup is not run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startup_program
is always used, even for the parallel testing, the code is here https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/parallel_executor.cc#L69
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite like sharing local_scopes. It's not possible for normal program and user has no idea what it is and what's the effect. How about:
ParallelInferExecutor(share_vars_from=train_executor)
Done.
main_program=None, | ||
startup_program=None, | ||
local_scopes=None, | ||
run_startup=True): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
startup_program
is always used, even for the parallel testing, the code is here https://github.com/PaddlePaddle/Paddle/blob/develop/paddle/fluid/framework/parallel_executor.cc#L69
// Create local scopes | ||
for (size_t i = 0; i < member_->places_.size(); ++i) { | ||
member_->local_scopes_.push_back(&scope->NewScope()); | ||
if (local_scopes.size() == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
loss_name, | ||
use_cuda, | ||
loss_name=None, | ||
use_cuda=None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modify the interface.
Fix #9571
run_startup
False.loss_name
for testing ParallelExecutor.Now, following testing code can run successfully. The correctness will be verified later.
The usage is as follows: