-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[MXNET-366]Extend MXNet Distributed Training by AllReduce #10696
Conversation
For PS-Lite, I change to the ps-lite branch to my private fork. Because when enabling dist_sync_mpi kvstore, we depends upon ps-lite google protobuf to do formatted message communication. But original google protobuf in ps-lite (v2.5.1) has bug on message serialization and de-serialization. So I change the google protobuf in ps-lite (v3.5.1). |
The future of traning NN is belong to CPU. |
@threeleafzerg Could you share instructions for building this, and add this as a step to the CI to verify the build. It looks like I need to install MPI separately? Or can we include it in the src as a 3rd party lib? |
About the numbers above 'Local Batch Size: 64' refers to 64 across many GPUs? Can you run a benchmark with higher batch size like [512 with 8 gpus or 256 with 4 gpus, for resnet50 imagenet] and compare the scalability? |
I installed MPI with |
@rahul003 Certainly, we can also do the following logic: Which one do you prefer? Need consensus. |
@rahul003 Local Batch Size: 64 means every node's batch size is 64 so global batch size is 64 * 8 = 512. |
@rahul003 For mpich, if you directly install ubuntu package of mpich, it's header file and lib file is not in the same sub folder. I suggest to download mpich (compile and build). (https://www.mpich.org/downloads/) |
Also ccing @eric-haibin-lin for reviews |
@@ -0,0 +1,827 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MXNet has no file as cxx. Could you change to cc?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cxx is selected on purpose to solve compilation dependency originally.
Our mpi_collectives.cxx depends upon generated mpi_message.pb.cc and mpi_message.pb.h to compile.
makefile rule SRC = $(wildcard src////.cc src///.cc src//.cc src/.cc) will contain mpi_collectives.cc if we don't change the suffix. $SRC has its own compilation rule. mpi related files has another compilation rule. That's why I use different suffix.
But I know your concern, I can use makefile built-in function filterout to remove mpi files in SRC. I will change its suffix to cc.
|
||
#if DEBUG_ON | ||
#define MXMPI_DEBUG(rank, fmt, args...) \ | ||
do { \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious, why this do{}while(0)
syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree with you. Not necessary. I will remove it.
@rahul003 For resnet50, local batch size 64, global batch size 64 * 8 = 512. (8 machine) Yes, we trained all on CPU. In general, all reduce performance should be similar for openmpi and mpich. Intel mpi has better performance on all reduce performance, but it's not free software though it's run-time part is free. I agree you that we select openmpi as default mpi if no one objects. (we will download open mpi zip in 3rd party and compile it.) For proto3, I tested original kvstore type dist_sync, it works fine for PS-Lite. Moreover, we just use protobuf 3.5.1. For PS-Lite it still uses proto2. (just need to specify its version explicitly.) |
@rahul003 Already finished code modification according to your comments. I add mpich as default mpi and I tried it works fine. Let me know if there's any questions. |
@eric-haibin-lin @rahul003 The defined works are done from our side now. Both the functionality and performance are qualified. Please help take a review again. Feel free to let us know for any open issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the current CI blocking you from adding unit tests?
.gitmodules
Outdated
@@ -6,7 +6,7 @@ | |||
url = https://github.com/dmlc/dmlc-core.git | |||
[submodule "3rdparty/ps-lite"] | |||
path = 3rdparty/ps-lite | |||
url = https://github.com/dmlc/ps-lite | |||
url = https://github.com/threeleafzerg/ps-lite.git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remember to revert after dmlc/ps-lite#137 is merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Certainly.
@@ -339,12 +339,32 @@ ifeq ($(USE_DIST_KVSTORE), 1) | |||
LDFLAGS += $(PS_LDFLAGS_A) | |||
endif | |||
|
|||
# for kvstore with type dist_sync_mpi |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szha could you help review Makefile?
python/mxnet/kvstore.py
Outdated
raise Exception("This api is not supported for kvstore with type %s. \ | ||
Please use pushpull instead."%self.type) | ||
|
||
def pushpull(self, key, ins, outs, priority=0): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this api invoke updater at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When using dist_sync_mpi kvstore, its gradient updater policy is the same as kvstore local. It always use updater, you can check api set_optimizer in kvstore in this patch. Hope I didn't misunderstand your meaning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the answer to Haibin's question is that update_on_kvstore is set to False when using this MPI kvstore.
@eric-haibin-lin Currently, in nightly test-all.sh, dist_sync_kvstore.py is added but it's under MXNet GPU build. We will add dist_sync_mpi_kvstore.py there and run it in CI. In nightly, this test script will be enabled until GPU is supported. |
Makefile
Outdated
@@ -382,6 +402,10 @@ else | |||
endif | |||
endif | |||
|
|||
MPI_SRC = $(wildcard src/mpi_collectives/src/*.cc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SRC_FILTER is the same.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@szha I know that. SRC_FILTER is put here for future extension. Because maybe in future, another source code which is put within src/ need different compilation rule same as MPI_SRC. SRC_FILTER will looks like SRC_FILTER += $(wildcard src/[other-component]/src/*.cc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's put the simplest possible code that does the job in the codebase, and let the next person worry about how to use different compilation rules for new files. Most likely if anyone other than you works on this code, your comment here will be no where to be found by the contributor. "Potential extension" is often a synonym for tech debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Then how about:
MPI_SRC = $(wildcard src/mpi_collectives/src/*.cc)
...
SRC_FILTER = $(MPI_SRC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ptrendx @DickJC123 any comments on the reserved interface for CUDA MPI/NCCL in this PR?
include/mxnet/kvstore.h
Outdated
virtual void PushPull(const std::vector<int> &keys, | ||
const std::vector<NDArray*> &in_values, | ||
const std::vector<NDArray*> &out_values, | ||
int priority = 0) {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we add LOG(FATAL) to the default implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will add them
python/mxnet/kvstore.py
Outdated
@@ -166,6 +166,9 @@ def push(self, key, value, priority=0): | |||
There is no synchronization between workers. | |||
One can use ``_barrier()`` to sync all workers. | |||
|
|||
Note: This api is not supported for kvstore with type dist_sync_mpi. Use pushpull |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can improve it with
:py:meth:`pushpull`
so that the doc renders with nice hyperlinks. See https://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#mxnet.gluon.ParameterDict.get_constant for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
* Copyright (c) 2018 by Contributors | ||
*/ | ||
|
||
#ifndef MXNET_MPI_COLLECTIVES_INCLUDE_MPI_COLLECTIVES_H_ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to create a src/mpi_collectives
folder. This is only used by kvstore. Maybe inside the kvstore folder? The namespace is also under mxnet::kvstore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haibin, do you mean the folder structure should be like
src/kvstore/mpi_collectives?
Since it's only used for mpi_collectives, it's OK.
|
||
namespace mxnet { | ||
namespace kvstore { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although some relevant documentations are added to c_api.h
, I still expect header only files like this come with detailed documentation for functionality description, arguments and return values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will add doxgen-style comment for these functions.
#include <stdio.h> | ||
#include <vector> | ||
|
||
#define DEBUG_ON 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MPI_UTIL_DEBUG_ON
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. I will refine it.
|
||
template<typename DType> | ||
MPI_Datatype MPI_Data_Type_Cast(void) { | ||
LOG(FATAL) << "Need to template specialization to get mpi data type"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need a default definition that throw errors at runtime? You already added template specialization. Would a template definition like below be sufficient?
template<typename DType>
MPI_Datatype MPI_Data_Type_Cast(void);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My original point is that if anyone used unsupported data type it can be caught by run time exception. In your way, it looks like any unsupported data type will be caught in compilation time.
OK for me to remove default implementation.
typedef std::unordered_map<std::string, std::vector<MPIRequest> > MessageTable; | ||
|
||
struct MPIGlobalState { | ||
std::atomic_flag initialized_flag = ATOMIC_FLAG_INIT; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some descriptions would be helpful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add it in next commit.
src/kvstore/kvstore_dist_sync_mpi.h
Outdated
int priority) override { | ||
int ret = MXMPIAllReduce(keys, in_values, out_values, priority); | ||
if (ret != 0) { | ||
LOG(WARNING) << "MXMPIAllReduce is not successful. ret: " << ret; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it a LOG(WARNING)
instead of LOG(FATAL)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree with you.
@eric-haibin-lin |
In kvstore_dist_sync_allreduce.h:
|
For some custom mpi versions (such as openmpi), directly using gcc/g++ may cause some problem (though it seems ok now for mpich). I strongly recommend to add some rules in the makefile so that when the USE_ALLREDUCE_DIST_KVSTORE flag is on, gcc/g++ will be replaced by mpicc/mpiCC or mpiCXX. |
@marcoabreu Finished code modification according to your review comments. |
@xcgoner
|
@threeleafzerg
|
@xcgoner I am using the openmpi(version 3.1) too and I didn't encounter any link error. (My Linux dist. centos 7) I tried to replace default g++ with mpicxx (change $(CXX)), build is OK. But one library cannot be found (libmklml_intel.so) in the runtime. I can co-work with MKLDNN support team with that, but this may not be included in this PR. BTW: After you use mpicc to build with openmpi successfully, does the mxnet allreduce kvstore works fine? |
@threeleafzerg |
@xcgoner |
2. Change the implementation of allreduce kvstore's barrier
Thanks for the great contribution @threeleafzerg ! |
Since mxnet will integrate horovod into the framework, I decided to close this PR. |
Really thank @threeleafzerg's efforts to enable the first version of AllReduce solution with MXNet 🥇 |
I have a problem when test the allreduce version.
pure virtual method called It seems the error occurred in destruct the kvstore, can anyone help me? @threeleafzerg |
@NanYan1119 I also noticed this bug. I will debug it this weekend. |
@NanYan1119 Since horovod will be the official all reduce solution in mxnet, can you switch to mxnet. horovod ? Thanks! |
@NanYan1119 For mxnet.horovod details, please consult eric-haibin-lin or ctcyang |
@threeleafzerg Sorry for the late reply!It's a goog idea. Do you know when the horovod for mxnet will be released? |
@NanYan1119 You can consult with ctcyang. Their proposal has already been posted in design wiki. |
@threeleafzerg Thank you! I will notice that. |
Description
We extend MXNet distributed training by adding one new type of kvstore (dist_sync_allreduce).
In this type of kvstore, since there's no parameter server, we replace original kvstore apis push and pull with one single api pushpull. We also add one api broadcast to broadcast the weight in root rank to other nodes. You can refer API Spec part in the design doc for details. We created a new thread on each node to sync and do allreduce through mpi.
Currently, we implement the CPU part and leave the place holder to GPU. It's easy to extend to GPU part with mpi support or nccl v2 support.
Following is the design doc
https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094
Performance and Accuracy
For Performance, our testing environment: Machine: SKX6148, Network: 10GbE, Topology: VGG16, Local Batch Size: 64 Single Node FPS: 27.3 (pic/second)
![performance-allreduce](https://user-images.githubusercontent.com/20310371/39390275-be929a70-4ac4-11e8-83a5-669dcd83acf8.png)
![resnet-50](https://user-images.githubusercontent.com/20310371/39390280-cc7af3e4-4ac4-11e8-945b-ffff973101fd.png)
For Accuracy, we compared top-1 and top-5 accuracy curve between 4 P100 GPU (with original kvstore dist_device) and 4 SKX6148 CPU (with new kvstore dist_sync_allreduce). We found the accuracy curve is highly matched. (GPU: 130 epochs. CPU: 113 epochs.)