-
Notifications
You must be signed in to change notification settings - Fork 5.7k
[Feature] NCCL2 distributed training #10349
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… gen_nccl_id_op
… gen_nccl_id_op
… gen_nccl_id_op
@@ -80,7 +80,13 @@ ParallelExecutor::ParallelExecutor( | |||
|
|||
// Bcast Parameters to all GPUs | |||
#ifdef PADDLE_WITH_CUDA | |||
member_->nccl_ctxs_.reset(new platform::NCCLContextMap(member_->places_)); | |||
auto *nccl_id_var = scope->FindVar("NCCLID"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have a constant NCCL_ID variable for all "NCCLID" strings?
if(WITH_GPU) | ||
op_library(gen_nccl_id_op DEPS nccl_common) | ||
else() | ||
set(DEPS_OPS ${DEPS_OPS} gen_nccl_id_op) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: indent
@@ -57,7 +57,9 @@ void ProcGetResponse(const VarHandle& var_h, const grpc::ByteBuffer& msg); | |||
|
|||
class BaseProcessor { | |||
public: | |||
explicit BaseProcessor(std::shared_ptr<grpc::Channel> ch) { context_ = NULL; } | |||
explicit BaseProcessor(std::shared_ptr<grpc::Channel> ch) { | |||
context_ = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? default is nullptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will refine grpc code in next PR
if (var->IsType<ncclUniqueId>()) { | ||
e.WriteVarlengthBeginning(VarMsg::kSerializedFieldNumber, | ||
NCCL_UNIQUE_ID_BYTES); | ||
ncclUniqueId* uid = var->GetMutable<ncclUniqueId>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will Get<>() work here?
// put nccl id in CPUPlace | ||
auto& dev_ctx = *pool.Get(platform::CPUPlace()); | ||
int trainer_id = Attr<int>("trainer_id"); | ||
framework::Scope& local_scope = scope.NewScope(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this op create a new scope each time it's called?
|
||
protected: | ||
mutable detail::AsyncGRPCServer* rpc_service_ = nullptr; | ||
mutable std::shared_ptr<std::thread> server_thread_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why shared_ptr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
op_registry will use copy constructors which unique_ptr
does not provide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps server_thread_ and rpc_service_ don't need to be protected member? It can just be temp variables created and deleted within GetIdByServer()?
paddle/fluid/platform/nccl_helper.h
Outdated
explicit NCCLContextMap(const std::vector<platform::Place> &places) { | ||
explicit NCCLContextMap(const std::vector<platform::Place> &places, | ||
ncclUniqueId *nccl_id = nullptr, | ||
size_t node_count = 0, size_t trainer_id = 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node_count -> num_trainers?
should default num_trainers=1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
node_count -> num_trainers?
Done.
should default num_trainers=1?
I think not, when nccl_id != nullptr
we should be at nccl distributed mode which num_nodes always > 1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But in single-machine set up, the actual number of trainers is 1? If a user see num_trainers=0 in the public interface of parallel_executor.py, he might just change that to 1 because he is actually using 1 node for training.
paddle/fluid/platform/nccl_helper.h
Outdated
} | ||
std::unique_ptr<ncclComm_t[]> comms(new ncclComm_t[order_.size()]); | ||
// if pass nccl_id here, can assume we are doing multi node training | ||
if (nccl_id == nullptr) { | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need for {} any more?
return; | ||
} | ||
std::unique_ptr<ncclComm_t[]> comms(new ncclComm_t[order_.size()]); | ||
// if pass nccl_id here, can assume we are doing multi node training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps also tests num_trainers > 1?
@@ -30,7 +30,9 @@ def __init__(self, | |||
num_threads=None, | |||
allow_op_delay=False, | |||
share_vars_from=None, | |||
use_default_grad_scale=True): | |||
use_default_grad_scale=True, | |||
num_nodes=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
num_trainers? should default be 1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I commented all other places that why I didn't follow the comments.
Thanks for the detailed review! very helpful.
class GenNCCLIdOpMaker : public framework::OpProtoAndCheckerMaker { | ||
public: | ||
GenNCCLIdOpMaker(OpProto* proto, OpAttrChecker* op_checker) | ||
: OpProtoAndCheckerMaker(proto, op_checker) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
… gen_nccl_id_op
… gen_nccl_id_op
Resolves: #10290
Note: if you runs in docker need to specify NCCL_SOCKET_IFNAME=docker0 or your eth interface.
Known issues:
num_threads=1