-
Notifications
You must be signed in to change notification settings - Fork 5.7k
Fluid new API: dist train without modifying code #10316
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
python/paddle/fluid/trainer.py
Outdated
# TODO(helin): support distributed training | ||
|
||
new_startup_prog, new_main_prog = dist_traspile_if_necessary( | ||
startup_prog, main_prog, optimize_ops, params_grads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimize_ops, params_grads are returned from optimizer.minimize()
python/paddle/fluid/trainer.py
Outdated
# TODO(helin): support distributed training | ||
|
||
startup_prog, main_prog = dist_traspile_if_necessary( | ||
startup_prog, main_prog, optimize_ops, params_grads) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cs2be : optimize_ops, params_grads
is returned by optimizer.minimize(), which will be called when creating the local startup program and main program.
python/paddle/fluid/trainer.py
Outdated
@@ -30,6 +33,48 @@ def __init__(self): | |||
self.type = Event.BEGIN_EPOCH | |||
|
|||
|
|||
def dist_traspile_if_necessary(startup_prog, main_prog, optimize_ops, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we have one class Trainer
and adding methods for local and distributed training into the class, or should we have derived classes LocalTrainer
, DistributedTrainer
, etc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the user should not need to change code when switching from local training to distributed training. E.g., the user should not need to switch from fluid.Trainer
to fluid.DistributedTrainer
.
Still, we can use the same external API (e.g., fluid.Trainer
), and internally have a class hierarchy for code reuse. Since currently we are not very certain what the code would look like, I would prefer to refactor it if needed later.
Works with 1 trainer 1 pserver. 2 trainer 1 pserver will stuck at the end of first step, still investigating. The user only need to set envrionment variables to enable distributed training. run pserver: PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=2 PADDLE_CURRENT_IP=127.0.0.1 python no_test_word2vec_new_api.py run trainer: PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 python no_test_word2vec_new_api.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks we may need to document about how the "trainer" determines whether we are doing distributed training
Fixes: #10312
Works with 1 trainer 1 pserver. 2 trainer 1 pserver will stuck at the
end of first step, still investigating.
The user only need to set envrionment variables to enable distributed
training.
run pserver:
run trainer: