Skip to content

[Feature] NCCL2 distributed training #10349

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 21 commits into from
May 15, 2018

Conversation

typhoonzero
Copy link
Contributor

@typhoonzero typhoonzero commented May 2, 2018

Resolves: #10290

Note: if you runs in docker need to specify NCCL_SOCKET_IFNAME=docker0 or your eth interface.

Known issues:

  1. can only run with parallel executor with num_threads=1
  2. in tcp mode, if previous training crashes, the data in network queue will still be read when starting a new job.

@typhoonzero typhoonzero changed the title Add gen_nccl_id_op [WIP] Add gen_nccl_id_op May 2, 2018
@typhoonzero typhoonzero changed the title [WIP] Add gen_nccl_id_op Add gen_nccl_id_op May 10, 2018
@typhoonzero typhoonzero changed the title Add gen_nccl_id_op [Feature] NCCL2 distributed training May 10, 2018
@@ -80,7 +80,13 @@ ParallelExecutor::ParallelExecutor(

// Bcast Parameters to all GPUs
#ifdef PADDLE_WITH_CUDA
member_->nccl_ctxs_.reset(new platform::NCCLContextMap(member_->places_));
auto *nccl_id_var = scope->FindVar("NCCLID");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have a constant NCCL_ID variable for all "NCCLID" strings?

if(WITH_GPU)
op_library(gen_nccl_id_op DEPS nccl_common)
else()
set(DEPS_OPS ${DEPS_OPS} gen_nccl_id_op)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: indent

@@ -57,7 +57,9 @@ void ProcGetResponse(const VarHandle& var_h, const grpc::ByteBuffer& msg);

class BaseProcessor {
public:
explicit BaseProcessor(std::shared_ptr<grpc::Channel> ch) { context_ = NULL; }
explicit BaseProcessor(std::shared_ptr<grpc::Channel> ch) {
context_ = nullptr;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? default is nullptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will refine grpc code in next PR

if (var->IsType<ncclUniqueId>()) {
e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber,
NCCL_UNIQUE_ID_BYTES);
ncclUniqueId* uid = var->GetMutable<ncclUniqueId>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will Get<>() work here?

// put nccl id in CPUPlace
auto& dev_ctx = *pool.Get(platform::CPUPlace());
int trainer_id = Attr<int>("trainer_id");
framework::Scope& local_scope = scope.NewScope();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this op create a new scope each time it's called?


protected:
mutable detail::AsyncGRPCServer* rpc_service_ = nullptr;
mutable std::shared_ptr<std::thread> server_thread_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why shared_ptr?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

op_registry will use copy constructors which unique_ptr does not provide.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps server_thread_ and rpc_service_ don't need to be protected member? It can just be temp variables created and deleted within GetIdByServer()?

explicit NCCLContextMap(const std::vector<platform::Place> &places) {
explicit NCCLContextMap(const std::vector<platform::Place> &places,
ncclUniqueId *nccl_id = nullptr,
size_t node_count = 0, size_t trainer_id = 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node_count -> num_trainers?
should default num_trainers=1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

node_count -> num_trainers?

Done.

should default num_trainers=1?

I think not, when nccl_id != nullptr we should be at nccl distributed mode which num_nodes always > 1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But in single-machine set up, the actual number of trainers is 1? If a user see num_trainers=0 in the public interface of parallel_executor.py, he might just change that to 1 because he is actually using 1 node for training.

}
std::unique_ptr<ncclComm_t[]> comms(new ncclComm_t[order_.size()]);
// if pass nccl_id here, can assume we are doing multi node training
if (nccl_id == nullptr) {
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for {} any more?

return;
}
std::unique_ptr<ncclComm_t[]> comms(new ncclComm_t[order_.size()]);
// if pass nccl_id here, can assume we are doing multi node training
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps also tests num_trainers > 1?

@@ -30,7 +30,9 @@ def __init__(self,
num_threads=None,
allow_op_delay=False,
share_vars_from=None,
use_default_grad_scale=True):
use_default_grad_scale=True,
num_nodes=0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

num_trainers? should default be 1?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I commented all other places that why I didn't follow the comments.

Thanks for the detailed review! very helpful.

class GenNCCLIdOpMaker : public framework::OpProtoAndCheckerMaker {
public:
GenNCCLIdOpMaker(OpProto* proto, OpAttrChecker* op_checker)
: OpProtoAndCheckerMaker(proto, op_checker) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please merge the latest code, @reyoung has replaced xxOpMaker with Maker in this PR #10486.

@panyx0718 panyx0718 merged commit 6ab935f into PaddlePaddle:develop May 15, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants