Fluid new API: dist train without modifying code #10316

helinwang · 2018-05-02T00:44:08Z

Works with 1 trainer 1 pserver. 2 trainer 1 pserver will stuck at the
end of first step, still investigating.

The user only need to set envrionment variables to enable distributed
training.

run pserver:

PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_CURRENT_IP=127.0.0.1 python no_test_word2vec_new_api.py

run trainer:

PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=1 PADDLE_TRAINER_ID=0 python no_test_word2vec_new_api.py

helinwang · 2018-05-02T00:47:27Z

python/paddle/fluid/trainer.py

-        # TODO(helin): support distributed training
+
+        new_startup_prog, new_main_prog = dist_traspile_if_necessary(
+            startup_prog, main_prog, optimize_ops, params_grads)


optimize_ops, params_grads are returned from optimizer.minimize()

helinwang · 2018-05-02T00:48:25Z

python/paddle/fluid/trainer.py

-        # TODO(helin): support distributed training
+
+        startup_prog, main_prog = dist_traspile_if_necessary(
+            startup_prog, main_prog, optimize_ops, params_grads)


@cs2be : optimize_ops, params_grads is returned by optimizer.minimize(), which will be called when creating the local startup program and main program.

wangkuiyi · 2018-05-02T20:49:25Z

python/paddle/fluid/trainer.py

@@ -30,6 +33,48 @@ def __init__(self):
        self.type = Event.BEGIN_EPOCH


+def dist_traspile_if_necessary(startup_prog, main_prog, optimize_ops,


Should we have one class Trainer and adding methods for local and distributed training into the class, or should we have derived classes LocalTrainer, DistributedTrainer, etc?

I think the user should not need to change code when switching from local training to distributed training. E.g., the user should not need to switch from fluid.Trainer to fluid.DistributedTrainer.

Still, we can use the same external API (e.g., fluid.Trainer), and internally have a class hierarchy for code reuse. Since currently we are not very certain what the code would look like, I would prefer to refactor it if needed later.

Works with 1 trainer 1 pserver. 2 trainer 1 pserver will stuck at the end of first step, still investigating. The user only need to set envrionment variables to enable distributed training. run pserver: PADDLE_TRAINING_ROLE=PSERVER PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=2 PADDLE_CURRENT_IP=127.0.0.1 python no_test_word2vec_new_api.py run trainer: PADDLE_TRAINING_ROLE=TRAINER PADDLE_PSERVER_IPS=127.0.0.1 PADDLE_TRAINERS=2 PADDLE_TRAINER_ID=0 python no_test_word2vec_new_api.py

typhoonzero

LGTM, thanks we may need to document about how the "trainer" determines whether we are doing distributed training

helinwang force-pushed the dist branch from 4559893 to ccfadb1 Compare May 2, 2018 00:46

helinwang commented May 2, 2018

View reviewed changes

helinwang force-pushed the dist branch from ccfadb1 to c12b156 Compare May 2, 2018 00:48

helinwang commented May 2, 2018

View reviewed changes

helinwang requested review from cs2be and removed request for cs2be May 2, 2018 01:10

This was referenced May 2, 2018

Refine distribute transpiler api #10342

Merged

Fluid new API scaffolding #10313

Merged

wangkuiyi reviewed May 2, 2018

View reviewed changes

helinwang force-pushed the dist branch from c12b156 to 8ee23da Compare May 3, 2018 17:46

helinwang changed the title ~~WIP: new API support dist train~~ Fluid new API: dist train without modifying code May 3, 2018

helinwang requested review from Yancey0623, gongweibao, typhoonzero and seiriosPlus May 3, 2018 18:30

helinwang mentioned this pull request May 3, 2018

Fluid new API: trainer support distributed training #10312

Closed

helinwang requested review from reyoung and JiayiFeng May 3, 2018 21:41

typhoonzero approved these changes May 4, 2018

View reviewed changes

helinwang merged commit 5812076 into PaddlePaddle:develop May 4, 2018

helinwang deleted the dist branch May 4, 2018 17:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fluid new API: dist train without modifying code #10316

Fluid new API: dist train without modifying code #10316

helinwang commented May 2, 2018 •

edited

Loading

helinwang May 2, 2018

helinwang May 2, 2018 •

edited

Loading

wangkuiyi May 2, 2018

helinwang May 2, 2018 •

edited

Loading

typhoonzero left a comment

		@@ -30,6 +33,48 @@ def __init__(self):
		self.type = Event.BEGIN_EPOCH


		def dist_traspile_if_necessary(startup_prog, main_prog, optimize_ops,

Fluid new API: dist train without modifying code #10316

Fluid new API: dist train without modifying code #10316

Conversation

helinwang commented May 2, 2018 • edited Loading

helinwang May 2, 2018

Choose a reason for hiding this comment

helinwang May 2, 2018 • edited Loading

Choose a reason for hiding this comment

wangkuiyi May 2, 2018

Choose a reason for hiding this comment

helinwang May 2, 2018 • edited Loading

Choose a reason for hiding this comment

typhoonzero left a comment

Choose a reason for hiding this comment

helinwang commented May 2, 2018 •

edited

Loading

helinwang May 2, 2018 •

edited

Loading

helinwang May 2, 2018 •

edited

Loading