[Feature] NCCL2 distributed training #10349

typhoonzero · 2018-05-02T12:21:17Z

Resolves: #10290

Note: if you runs in docker need to specify NCCL_SOCKET_IFNAME=docker0 or your eth interface.

Known issues:

can only run with parallel executor with num_threads=1
in tcp mode, if previous training crashes, the data in network queue will still be read when starting a new job.

… gen_nccl_id_op

panyx0718 · 2018-05-11T07:40:14Z

paddle/fluid/framework/parallel_executor.cc

@@ -80,7 +80,13 @@ ParallelExecutor::ParallelExecutor(

 // Bcast Parameters to all GPUs
 #ifdef PADDLE_WITH_CUDA
-  member_->nccl_ctxs_.reset(new platform::NCCLContextMap(member_->places_));
+  auto *nccl_id_var = scope->FindVar("NCCLID");


have a constant NCCL_ID variable for all "NCCLID" strings?

panyx0718 · 2018-05-11T09:09:34Z

paddle/fluid/operators/CMakeLists.txt

+    if(WITH_GPU)
+        op_library(gen_nccl_id_op DEPS nccl_common)
+    else()
+    set(DEPS_OPS ${DEPS_OPS} gen_nccl_id_op)


nit: indent

panyx0718 · 2018-05-11T09:11:33Z

paddle/fluid/operators/detail/grpc_client.h

@@ -57,7 +57,9 @@ void ProcGetResponse(const VarHandle& var_h, const grpc::ByteBuffer& msg);

 class BaseProcessor {
 public:
-  explicit BaseProcessor(std::shared_ptr<grpc::Channel> ch) { context_ = NULL; }
+  explicit BaseProcessor(std::shared_ptr<grpc::Channel> ch) {
+    context_ = nullptr;


Is this necessary? default is nullptr?

Will refine grpc code in next PR

panyx0718 · 2018-05-11T09:17:47Z

paddle/fluid/operators/detail/sendrecvop_utils.cc

+  if (var->IsType<ncclUniqueId>()) {
+    e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber,
+                              NCCL_UNIQUE_ID_BYTES);
+    ncclUniqueId* uid = var->GetMutable<ncclUniqueId>();


will Get<>() work here?

panyx0718 · 2018-05-11T09:26:17Z

paddle/fluid/operators/gen_nccl_id_op.cc

+    // put nccl id in CPUPlace
+    auto& dev_ctx = *pool.Get(platform::CPUPlace());
+    int trainer_id = Attr<int>("trainer_id");
+    framework::Scope& local_scope = scope.NewScope();


Does this op create a new scope each time it's called?

panyx0718 · 2018-05-11T09:32:38Z

paddle/fluid/operators/gen_nccl_id_op.cc

+
+ protected:
+  mutable detail::AsyncGRPCServer* rpc_service_ = nullptr;
+  mutable std::shared_ptr<std::thread> server_thread_;


why shared_ptr?

op_registry will use copy constructors which unique_ptr does not provide.

perhaps server_thread_ and rpc_service_ don't need to be protected member? It can just be temp variables created and deleted within GetIdByServer()?

panyx0718 · 2018-05-11T09:34:24Z

paddle/fluid/platform/nccl_helper.h

-  explicit NCCLContextMap(const std::vector<platform::Place> &places) {
+  explicit NCCLContextMap(const std::vector<platform::Place> &places,
+                          ncclUniqueId *nccl_id = nullptr,
+                          size_t node_count = 0, size_t trainer_id = 0) {


node_count -> num_trainers?
should default num_trainers=1?

node_count -> num_trainers?

Done.

should default num_trainers=1?

I think not, when nccl_id != nullptr we should be at nccl distributed mode which num_nodes always > 1

But in single-machine set up, the actual number of trainers is 1? If a user see num_trainers=0 in the public interface of parallel_executor.py, he might just change that to 1 because he is actually using 1 node for training.

panyx0718 · 2018-05-11T09:35:41Z

paddle/fluid/platform/nccl_helper.h

+    }
+    std::unique_ptr<ncclComm_t[]> comms(new ncclComm_t[order_.size()]);
+    // if pass nccl_id here, can assume we are doing multi node training
+    if (nccl_id == nullptr) {
      {


no need for {} any more?

panyx0718 · 2018-05-11T09:36:23Z

paddle/fluid/platform/nccl_helper.h

+      return;
+    }
+    std::unique_ptr<ncclComm_t[]> comms(new ncclComm_t[order_.size()]);
+    // if pass nccl_id here, can assume we are doing multi node training


perhaps also tests num_trainers > 1?

panyx0718 · 2018-05-11T09:37:26Z

python/paddle/fluid/parallel_executor.py

@@ -30,7 +30,9 @@ def __init__(self,
                 num_threads=None,
                 allow_op_delay=False,
                 share_vars_from=None,
-                 use_default_grad_scale=True):
+                 use_default_grad_scale=True,
+                 num_nodes=0,


num_trainers? should default be 1?

Done. I commented all other places that why I didn't follow the comments.

Thanks for the detailed review! very helpful.

chengduoZH · 2018-05-11T11:29:38Z

paddle/fluid/operators/gen_nccl_id_op.cc

+class GenNCCLIdOpMaker : public framework::OpProtoAndCheckerMaker {
+ public:
+  GenNCCLIdOpMaker(OpProto* proto, OpAttrChecker* op_checker)
+      : OpProtoAndCheckerMaker(proto, op_checker) {


Please merge the latest code, @reyoung has replaced xxOpMaker with Maker in this PR #10486.

… gen_nccl_id_op

add gen_nccl_id_op

eeed7af

typhoonzero changed the title ~~Add gen_nccl_id_op~~ [WIP] Add gen_nccl_id_op May 2, 2018

typhoonzero added 9 commits May 3, 2018 12:22

fix compile

7237323

complete code

d9320dc

testing

3667578

fix ci

0598a4b

fix testing

82c61db

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a529d79

… gen_nccl_id_op

workable version

17009d0

fix build

0f86397

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

a135fec

… gen_nccl_id_op

typhoonzero requested review from reyoung, Yancey0623, panyx0718 and chengduoZH May 8, 2018 01:47

typhoonzero mentioned this pull request May 8, 2018

Issues when debuging NCCL2 distributed training #10499

Closed

typhoonzero changed the title ~~[WIP] Add gen_nccl_id_op~~ Add gen_nccl_id_op May 10, 2018

typhoonzero mentioned this pull request May 10, 2018

Add NCCL2 rdma train doc #10561

Merged

typhoonzero changed the title ~~Add gen_nccl_id_op~~ [Feature] NCCL2 distributed training May 10, 2018

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

04bde96

… gen_nccl_id_op

panyx0718 reviewed May 11, 2018

View reviewed changes

chengduoZH reviewed May 11, 2018

View reviewed changes

typhoonzero and others added 8 commits May 11, 2018 19:54

follow comments

f5840d8

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

0ae726f

… gen_nccl_id_op

update op

7a7d27b

update

6ef60de

update

6387a15

fix build and merge develop

5ae0c66

Merge branch 'develop' of https://github.com/PaddlePaddle/Paddle into…

928418a

… gen_nccl_id_op

update by comments

7b0c027

typhoonzero added 2 commits May 14, 2018 15:47

remove comments

373a2e6

remove comments

872e55b

panyx0718 approved these changes May 15, 2018

View reviewed changes

panyx0718 merged commit 6ab935f into PaddlePaddle:develop May 15, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] NCCL2 distributed training #10349

[Feature] NCCL2 distributed training #10349

typhoonzero commented May 2, 2018 •

edited

Loading

panyx0718 May 11, 2018

panyx0718 May 11, 2018

panyx0718 May 11, 2018

typhoonzero May 11, 2018

panyx0718 May 11, 2018

panyx0718 May 11, 2018

panyx0718 May 11, 2018

typhoonzero May 11, 2018

panyx0718 May 13, 2018

panyx0718 May 11, 2018

typhoonzero May 11, 2018

panyx0718 May 13, 2018

panyx0718 May 11, 2018

panyx0718 May 11, 2018

panyx0718 May 11, 2018

typhoonzero May 12, 2018

chengduoZH May 11, 2018

[Feature] NCCL2 distributed training #10349

[Feature] NCCL2 distributed training #10349

Conversation

typhoonzero commented May 2, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

typhoonzero commented May 2, 2018 •

edited

Loading