Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[MXNET-366]Extend MXNet Distributed Training by AllReduce #10696

Closed
wants to merge 43 commits into from

Conversation

threeleafzerg
Copy link

@threeleafzerg threeleafzerg commented Apr 26, 2018

Description

We extend MXNet distributed training by adding one new type of kvstore (dist_sync_allreduce).
In this type of kvstore, since there's no parameter server, we replace original kvstore apis push and pull with one single api pushpull. We also add one api broadcast to broadcast the weight in root rank to other nodes. You can refer API Spec part in the design doc for details. We created a new thread on each node to sync and do allreduce through mpi.
Currently, we implement the CPU part and leave the place holder to GPU. It's easy to extend to GPU part with mpi support or nccl v2 support.

Following is the design doc
https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094

Performance and Accuracy

For Performance, our testing environment: Machine: SKX6148, Network: 10GbE, Topology: VGG16, Local Batch Size: 64 Single Node FPS: 27.3 (pic/second)
performance-allreduce
For Accuracy, we compared top-1 and top-5 accuracy curve between 4 P100 GPU (with original kvstore dist_device) and 4 SKX6148 CPU (with new kvstore dist_sync_allreduce). We found the accuracy curve is highly matched. (GPU: 130 epochs. CPU: 113 epochs.)
resnet-50

@threeleafzerg threeleafzerg requested a review from szha as a code owner April 26, 2018 05:36
@threeleafzerg
Copy link
Author

For PS-Lite, I change to the ps-lite branch to my private fork. Because when enabling dist_sync_mpi kvstore, we depends upon ps-lite google protobuf to do formatted message communication. But original google protobuf in ps-lite (v2.5.1) has bug on message serialization and de-serialization. So I change the google protobuf in ps-lite (v3.5.1).

@threeleafzerg
Copy link
Author

@pengzhao-intel

@chinakook
Copy link
Contributor

The future of traning NN is belong to CPU.

@piiswrong
Copy link
Contributor

@mli

@threeleafzerg threeleafzerg changed the title Extend MXNet Distributed Training by MPI AllReduce [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce Apr 28, 2018
@pengzhao-intel
Copy link
Contributor

@mli @rahul003
Could you help take the review for this PR?

@rahul003
Copy link
Member

rahul003 commented May 3, 2018

@threeleafzerg Could you share instructions for building this, and add this as a step to the CI to verify the build. It looks like I need to install MPI separately? Or can we include it in the src as a 3rd party lib?

@rahul003
Copy link
Member

rahul003 commented May 3, 2018

About the numbers above 'Local Batch Size: 64' refers to 64 across many GPUs? Can you run a benchmark with higher batch size like [512 with 8 gpus or 256 with 4 gpus, for resnet50 imagenet] and compare the scalability?

@rahul003
Copy link
Member

rahul003 commented May 3, 2018

I installed MPI with sudo apt-get install mpich on ubuntu and gave MPIROOT as /usr/local/bin/mpi . Build failed saying mpi.h not found. Am I doing something wrong?

@threeleafzerg
Copy link
Author

threeleafzerg commented May 4, 2018

@rahul003
The build instruction is as follows:
USE_DIST_KVSTORE = 1
USE_ALLREDUCE_DIST_KVSTORE = 1
MPI_ROOT=/usr/lib/openmpi
We let the end user to select which mpi to use. (openmpi, mpich, or intel mpi.) That's why we don't include src as 3rd party lib. You can check horovod, they play the same trick. https://github.com/uber/horovod#install
So the end user need to install MPI separately.
Can you try latest open mpi? We tried both open mpi and intel mpi, their release dir structure looks like following:
/home/zhouhaiy/openmpi/build
[zhouhaiy@mlt-ace build]$ ls
bin etc include lib share
Looks like mpich release dir is not same as open mpi, I will have a check.

Certainly, we can also do the following logic:
If env MPI_ROOT is set, we use this mpi lib version from this env, otherwise we download open source 3rd party mpi source code, compile build and mxnet depends upon it.

Which one do you prefer? Need consensus.

@threeleafzerg
Copy link
Author

@rahul003 Local Batch Size: 64 means every node's batch size is 64 so global batch size is 64 * 8 = 512.
Currently, the result is based upon CPU. For resnet50, we tried, the scaling efficiency close to 99%. Our currently implementation covers CPU and leaves place holder for GPU.

@threeleafzerg
Copy link
Author

@rahul003 For mpich, if you directly install ubuntu package of mpich, it's header file and lib file is not in the same sub folder. I suggest to download mpich (compile and build). (https://www.mpich.org/downloads/)
The built mpich has the same release dir as open mpi and intel mpi.
[zhouhaiy@mlt-ace build]$ pwd
/home/zhouhaiy/mpich2/build
[zhouhaiy@mlt-ace build]$ ls
bin include lib share
I already tried it and mxnet can build with mpich2.

@rahul003
Copy link
Member

rahul003 commented May 4, 2018

  • Implementation only for CPU is very restrictive. Are you also trying to implement for GPU? Are you running into any issues? Scaling efficiency is something we need a lot of work on, as my runs with large number of machines are showing. It would be awesome if we can have MPI for GPU as well. I'm happy to offer any help needed to get this done.

  • You mentioned efficiency for resnet50, but for what batch size? Are all these training numbers on CPU?

  • Leaving the flexibility of which MPI to user is good for advanced users. But for lot of people it adds unnecessary complexity IMO. How do the different MPI frameworks differ? Is one framework more performant in your experience? Should we choose that as default and provide a way for user to replace that with their own framework using the environment variable or make flag. I would really recommend this option.

  • @reminisce This PR tries to change protobuf to proto3. I wanted to draw your attention to that change. Could you review that in light of your plan to move to proto3 for mxboard, etc.

Also ccing @eric-haibin-lin for reviews

@@ -0,0 +1,827 @@
/*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MXNet has no file as cxx. Could you change to cc?

Copy link
Author

@threeleafzerg threeleafzerg May 4, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cxx is selected on purpose to solve compilation dependency originally.
Our mpi_collectives.cxx depends upon generated mpi_message.pb.cc and mpi_message.pb.h to compile.
makefile rule SRC = $(wildcard src////.cc src///.cc src//.cc src/.cc) will contain mpi_collectives.cc if we don't change the suffix. $SRC has its own compilation rule. mpi related files has another compilation rule. That's why I use different suffix.

But I know your concern, I can use makefile built-in function filterout to remove mpi files in SRC. I will change its suffix to cc.


#if DEBUG_ON
#define MXMPI_DEBUG(rank, fmt, args...) \
do { \
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious, why this do{}while(0) syntax?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree with you. Not necessary. I will remove it.

@threeleafzerg
Copy link
Author

threeleafzerg commented May 4, 2018

@rahul003
For GPU, I agree with your comment. But the majority code of this PR is the infrastructure of adding allreduce into MXNet which is shared by both CPU and GPU. Currently we leave the place holder for GPU for future extension. We don't run into any issue on GPU and we enable CPU firstly simply because we
currently have a lot CPU multi-node environment We can discuss further about how to add GPU extension. @pengzhao-intel Patric will shed more lights upon it.

For resnet50, local batch size 64, global batch size 64 * 8 = 512. (8 machine) Yes, we trained all on CPU.

In general, all reduce performance should be similar for openmpi and mpich. Intel mpi has better performance on all reduce performance, but it's not free software though it's run-time part is free. I agree you that we select openmpi as default mpi if no one objects. (we will download open mpi zip in 3rd party and compile it.)

For proto3, I tested original kvstore type dist_sync, it works fine for PS-Lite. Moreover, we just use protobuf 3.5.1. For PS-Lite it still uses proto2. (just need to specify its version explicitly.)

@threeleafzerg
Copy link
Author

@rahul003 Already finished code modification according to your comments. I add mpich as default mpi and I tried it works fine. Let me know if there's any questions.

@pengzhao-intel
Copy link
Contributor

@eric-haibin-lin @rahul003 The defined works are done from our side now. Both the functionality and performance are qualified.

Please help take a review again. Feel free to let us know for any open issues.

@eric-haibin-lin eric-haibin-lin self-assigned this May 10, 2018
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the current CI blocking you from adding unit tests?

.gitmodules Outdated
@@ -6,7 +6,7 @@
url = https://github.com/dmlc/dmlc-core.git
[submodule "3rdparty/ps-lite"]
path = 3rdparty/ps-lite
url = https://github.com/dmlc/ps-lite
url = https://github.com/threeleafzerg/ps-lite.git
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remember to revert after dmlc/ps-lite#137 is merged.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Certainly.

@@ -339,12 +339,32 @@ ifeq ($(USE_DIST_KVSTORE), 1)
LDFLAGS += $(PS_LDFLAGS_A)
endif

# for kvstore with type dist_sync_mpi
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha could you help review Makefile?

raise Exception("This api is not supported for kvstore with type %s. \
Please use pushpull instead."%self.type)

def pushpull(self, key, ins, outs, priority=0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this api invoke updater at all?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When using dist_sync_mpi kvstore, its gradient updater policy is the same as kvstore local. It always use updater, you can check api set_optimizer in kvstore in this patch. Hope I didn't misunderstand your meaning.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the answer to Haibin's question is that update_on_kvstore is set to False when using this MPI kvstore.

@threeleafzerg
Copy link
Author

@eric-haibin-lin Currently, in nightly test-all.sh, dist_sync_kvstore.py is added but it's under MXNet GPU build. We will add dist_sync_mpi_kvstore.py there and run it in CI. In nightly, this test script will be enabled until GPU is supported.

Makefile Outdated
@@ -382,6 +402,10 @@ else
endif
endif

MPI_SRC = $(wildcard src/mpi_collectives/src/*.cc)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SRC_FILTER is the same.

Copy link
Author

@threeleafzerg threeleafzerg May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@szha I know that. SRC_FILTER is put here for future extension. Because maybe in future, another source code which is put within src/ need different compilation rule same as MPI_SRC. SRC_FILTER will looks like SRC_FILTER += $(wildcard src/[other-component]/src/*.cc)

Copy link
Member

@szha szha May 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's put the simplest possible code that does the job in the codebase, and let the next person worry about how to use different compilation rules for new files. Most likely if anyone other than you works on this code, your comment here will be no where to be found by the contributor. "Potential extension" is often a synonym for tech debt.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Then how about:
MPI_SRC = $(wildcard src/mpi_collectives/src/*.cc)
...
SRC_FILTER = $(MPI_SRC)

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptrendx @DickJC123 any comments on the reserved interface for CUDA MPI/NCCL in this PR?

virtual void PushPull(const std::vector<int> &keys,
const std::vector<NDArray*> &in_values,
const std::vector<NDArray*> &out_values,
int priority = 0) {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we add LOG(FATAL) to the default implementation?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will add them

@@ -166,6 +166,9 @@ def push(self, key, value, priority=0):
There is no synchronization between workers.
One can use ``_barrier()`` to sync all workers.

Note: This api is not supported for kvstore with type dist_sync_mpi. Use pushpull
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can improve it with

:py:meth:`pushpull`

so that the doc renders with nice hyperlinks. See https://mxnet.incubator.apache.org/versions/master/api/python/gluon/gluon.html#mxnet.gluon.ParameterDict.get_constant for example

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

* Copyright (c) 2018 by Contributors
*/

#ifndef MXNET_MPI_COLLECTIVES_INCLUDE_MPI_COLLECTIVES_H_
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we want to create a src/mpi_collectives folder. This is only used by kvstore. Maybe inside the kvstore folder? The namespace is also under mxnet::kvstore.

Copy link
Author

@threeleafzerg threeleafzerg May 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haibin, do you mean the folder structure should be like
src/kvstore/mpi_collectives?
Since it's only used for mpi_collectives, it's OK.


namespace mxnet {
namespace kvstore {

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although some relevant documentations are added to c_api.h, I still expect header only files like this come with detailed documentation for functionality description, arguments and return values.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will add doxgen-style comment for these functions.

#include <stdio.h>
#include <vector>

#define DEBUG_ON 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MPI_UTIL_DEBUG_ON

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. I will refine it.


template<typename DType>
MPI_Datatype MPI_Data_Type_Cast(void) {
LOG(FATAL) << "Need to template specialization to get mpi data type";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need a default definition that throw errors at runtime? You already added template specialization. Would a template definition like below be sufficient?

template<typename DType>
MPI_Datatype MPI_Data_Type_Cast(void);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original point is that if anyone used unsupported data type it can be caught by run time exception. In your way, it looks like any unsupported data type will be caught in compilation time.
OK for me to remove default implementation.

typedef std::unordered_map<std::string, std::vector<MPIRequest> > MessageTable;

struct MPIGlobalState {
std::atomic_flag initialized_flag = ATOMIC_FLAG_INIT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some descriptions would be helpful

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add it in next commit.

int priority) override {
int ret = MXMPIAllReduce(keys, in_values, out_values, priority);
if (ret != 0) {
LOG(WARNING) << "MXMPIAllReduce is not successful. ret: " << ret;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it a LOG(WARNING) instead of LOG(FATAL)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree with you.

@threeleafzerg
Copy link
Author

threeleafzerg commented May 15, 2018

@eric-haibin-lin
Hi haibin, I have already finished code modification according to your comments. Any question, please let me know. Thanks!
Note: Upon our internal code review (intel), I changed the following naming:
dist_sync_mpi -> dist_sync_allreduce
mpi_collectives -> collectives
MPI_Wrapper -> COLL_Wrapper
Because the collectives can be implemented not only in MPI library. (e.g. nccl library)
The corresponding design doc has already been updated.
https://docs.google.com/document/d/1e4anwDiS18cWP49FAghU6tqqdtnRKUcbNJJxvhIfvIA/edit#heading=h.t762l56r1094
Gluon is also supported.

@threeleafzerg threeleafzerg changed the title [MXNET-366]Extend MXNet Distributed Training by MPI AllReduce [MXNET-366]Extend MXNet Distributed Training by AllReduce May 16, 2018
@eric-haibin-lin eric-haibin-lin removed their assignment May 23, 2018
@congxie1108
Copy link

In kvstore_dist_sync_allreduce.h:

  1. Since the allreduce kvstore is now derived from kvstore_local, you don't need to override push/pull. Simply overriding PushImpl/PullImpl/PullRowSparseImpl should work.
  2. It would be better if you could move the definition of "CheckUnique" from kvstoredist to kvstorelocal and reuse it in allreduce. It is unnecessary to copy the same code from kvstoredist.

@congxie1108
Copy link

For some custom mpi versions (such as openmpi), directly using gcc/g++ may cause some problem (though it seems ok now for mpich). I strongly recommend to add some rules in the makefile so that when the USE_ALLREDUCE_DIST_KVSTORE flag is on, gcc/g++ will be replaced by mpicc/mpiCC or mpiCXX.

@threeleafzerg
Copy link
Author

@marcoabreu Finished code modification according to your review comments.

@threeleafzerg
Copy link
Author

threeleafzerg commented Jul 4, 2018

@xcgoner
Thanks for your review comment!

  1. I think it's better to override explicitly in direct interface (push and pull) to warn the end user not to use these interfaces in this allreduce kvstore.
  2. You are right. Finished code modification according to your comments.
  3. what's problem you meet when your use openmpi? It looks like in my local machine, openmpi can be used. But the performance is very bad because I don't know how to configure openmpi.

@congxie1108
Copy link

@threeleafzerg
Thanks for the quick response.

  1. Explicit overriding makes sense to me according to your explanation, though overriding 4 functions and repeating the same codes seems a little bit redundant to me. But it's ok anyway.
  2. I was trying to use the latest stable version of openmpi (version 3.1) instead of the default old version in ubuntu. There was some kind of link error, which disappears when I used mpicc/mpiCC to replace gcc/g++. It seems that if you are using the default old version of openmpi in ubuntu (installed by apt-get), gcc/g++ will be ok.

@threeleafzerg
Copy link
Author

@xcgoner I am using the openmpi(version 3.1) too and I didn't encounter any link error. (My Linux dist. centos 7) I tried to replace default g++ with mpicxx (change $(CXX)), build is OK. But one library cannot be found (libmklml_intel.so) in the runtime. I can co-work with MKLDNN support team with that, but this may not be included in this PR.

BTW: After you use mpicc to build with openmpi successfully, does the mxnet allreduce kvstore works fine?

@congxie1108
Copy link

@threeleafzerg
I'm using openmpi because I wanna experimentally try some new feature which is only provided by openmpi. I think for this PR you can leave the compilers to be gcc/g++. I will start my own PR once I get it done.
Openmpi seems fine when I use it. But I have not fully tested the performance or compared it with mpich yet. I will report the follow up once I get it done after your PR is completed.

@threeleafzerg
Copy link
Author

@xcgoner
OK. If you have any question, don't hesitate to post the question.

@lupesko
Copy link
Contributor

lupesko commented Aug 6, 2018

Thanks for the great contribution @threeleafzerg !
@anirudh2290 @eric-haibin-lin - can you guys check this out? (design doc as a reference)

@threeleafzerg
Copy link
Author

Since mxnet will integrate horovod into the framework, I decided to close this PR.

@pengzhao-intel
Copy link
Contributor

Really thank @threeleafzerg's efforts to enable the first version of AllReduce solution with MXNet 🥇
Actually, the current implementation is partially merged into Horovod solution. And the new solution will take the advantages of both this PR and Horovod so it will be very nice for the community 👍

@NanYan1119
Copy link

I have a problem when test the allreduce version.

  1. I clone the master code from https://github.com/threeleafzerg/incubator-mxnet.
  2. I use "make -j $(nproc) USE_OPENCV=1 USE_BLAS=openblas USE_CUDA=1 USE_CUDA_PATH=/usr/local/cuda USE_CUDNN=1 USE_PROFILER=1 USE_DIST_KVSTORE=1 USE_ALLREDUCE_DIST_KVSTORE=1 MPI_ROOT=/usr/local/openmpi" to compile.
  3. But when I run the script ~/incubator-mxnet/tests/nightly/dist_allreduce_sync_kvstore.py, it end with an error:

pure virtual method called
terminate called without an active exception
*** Process received signal ***
Signal: Aborted (6)
Signal code: (-6)

It seems the error occurred in destruct the kvstore, can anyone help me? @threeleafzerg

@threeleafzerg
Copy link
Author

@NanYan1119 I also noticed this bug. I will debug it this weekend.

@threeleafzerg
Copy link
Author

@NanYan1119 Since horovod will be the official all reduce solution in mxnet, can you switch to mxnet. horovod ? Thanks!

@threeleafzerg
Copy link
Author

@NanYan1119 For mxnet.horovod details, please consult eric-haibin-lin or ctcyang

@NanYan1119
Copy link

@threeleafzerg Sorry for the late reply!It's a goog idea. Do you know when the horovod for mxnet will be released?

@threeleafzerg
Copy link
Author

@NanYan1119 You can consult with ctcyang. Their proposal has already been posted in design wiki.
https://cwiki.apache.org/confluence/display/MXNET/Horovod-MXNet+Integration

@NanYan1119
Copy link

@threeleafzerg Thank you! I will notice that.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.