Multi GPU code update #710

ericj974 · 2018-06-23T07:39:49Z

Slow training issue fixed by instantiating the base model on cpu
There was an input split issue when multi_gpu. This was fixed by reusing the slicing method proposed by default by keras
Some slight code update for parallel_model.py to cater for keras 2.2.0

- Harmonization of the parameter names for describing shapes (for inputs / outputs)

waleedka · 2018-07-12T07:33:53Z

mrcnn/parallel_model.py

+                        # duplication.
+                        with tf.device(x.device):
+                            input_shape = K.int_shape(x)[1:]
+                            slice_i = KL.Lambda(get_slice,


Since x is an input to this Lambda, and the Lambda OPs are on one of the GPUs while x is on the CPU since it's the input to the base model, then this would require TF to move x to the GPU to slice it there and then discard all but the one slice that's needed on that GPU. So this seems slower to me because it'll have to move more data to the GPU. What am I missing?

Yes, idea is to have the split done on CPU, data moved to each resp. GPU, processing on the resp. GPU and the CPU is in charge on concatenating and else. I actually slightly modified keras code (see https://github.com/keras-team/keras/blob/master/keras/utils/multi_gpu_utils.py), for which the comments for multi_gpu_model is useful

What I was trying to say in my previous comment is that the splitting is already being done on the CPU. The new changes make it happen on the GPU instead, and that's slower. I could be missing something, but it seems to me that the new changes are slower than the current state of the code. I'll read the links you posted below (thanks!)

waleedka · 2018-07-12T07:43:05Z

@ericj974 Thanks! I'm starting to review this, but it's a big PR so it might take me a while. It would help me a lot if you point me to any resources or discussions that explain the reasoning behind the changes you're proposing.

ericj974 · 2018-07-13T01:56:57Z

@waleedka yes sorry noticed too late that the PR would include more than multi-gpu related commits...
For multi-gpu and the fact of initializing the model on the cpu, I was actually reading the comments in https://github.com/keras-team/keras/blob/master/keras/utils/multi_gpu_utils.py along with https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/

maxfrei750 · 2018-08-17T13:09:40Z

TL;DR:
This pull request doesn't seem to work. However there are at least two persons, claiming it would work: @pieterbl86 (see #708), @ericj974 (author of this PR). Perhaps it boils down to dependencies, so it would be great to examine if this truly works. If it does indeed work for some, then what can we do to make it work for everyone?
------------------------------------------------------

First of all, I'd like to thank @waleedka for this incredibly helpful repo and all the work that has been put into this project already. Secondly, I'd like to thank @ericj974 for the efforts to solve the problems concerning the training on multiple GPUs, because I think that it is the major issue of this repo at the moment.

I'd really like to help solving the problem, but so far, I have not even been able to run the code of @ericj974, without making changes to the code or the dependencies.

In order to work on and provide a reproducible environment, I created a docker container, which should match the setup of @ericj974. It can be pulled via the following command:
docker pull maxfrei750/mrcnn:ericj974

and can be run via the following command:

nvidia-docker run --network=host -e PASSWORD=desiredPassword --rm -it -v /path/to/maskrcnn:/notebooks maxfrei750/mrcnn:ericj974

where you should replace desiredPassword with a password of your choice for the jupyter notebook and replace /path/to/maskrcnn with the path you would like jupyter to use as its base directory.

After the container has started, jupyter can be reached via http://localhost:8888.

The docker container is based on this Dockerfile (I had to change the filetype to txt, in order to upload it), which in turn is based on this file from the repo of @ericj974. Unfortunately, I had to guess some things like the cuda version (9) and the os (Ubuntu 16.04). The gpu driver (in my case 384.130) is passed through to the container from the host system

@ericj974: It would be great, if you could specify the dependencies as precisely as possible, so that I can update the docker image accordingly.

The following tests were performed on the train_shapes.ipynb file:

As I mentioned before, I can't run the code of @ericj974 without changes.

Using the original code of @ericj974, I encounter issue load_weights_from_hdf5_group_by_name unknown attribute in latest keras #566, when running train_shapes.ipynb. In order to fix this issue, either the Keras version has to be downgraded or the pull request Fix Keras engine module topology to saving #662 has to be merged into the repo of @ericj974.

@ericj974: How did you solve this issue?

After having fixed problem 1 the code does run on a single gpu but is very slow (~12 s/step vs. ~0.5 s/step for the latest version of @waleedka).
When being run on 2 gpus, the behaviour of the code depends on how I fixed problem 1. If I merge Fix Keras engine module topology to saving #662, then python just dies with the following error stack:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-123>", line 2, in initialize
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 456, in initialize
    self.init_sockets()
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 238, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

If I solve problem 1 by downgrading Keras to 2.1.3 I get the following error:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor of shape [7,7,256,1024] and type float
[[Node: training/SGD/zeros_30 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,256,1024] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

I'm not shure, if the OOM error really depends on my hardware (2x GTX1080 with 8GB VRAM each). However, after having decreased the number of images per GPU to 1, the OOM error still remains.

@ericj974: Are you able to run train_shapes.ipynb with the code you supply in your repo?

I was however able to run the code with the configuration given by @pieterbl86 in issue #708.

Ubuntu 16.04:
GPU driver: 384.130
Tensorflow_gpu = 1.8
Keras version = 2.1.6

Therefore I created a docker container for this config as well:
docker pull maxfrei750/mrcnn:pieterbl86

Dockerfile.pieterbl86.txt

Unfortunately, the training was now again slower on 2 GPUs than on a single GPU (~1 s/step vs ~0.5s/step), which is basically the same behaviour like the original state of the code of @waleedka at the moment.

I tried some other things to make the code work, but so far nothing did result in the training being faster on 2 gpus than on a single gpu. I'd really appreciate any help concerning the solution of this issue.

Kind regards.

maxfrei750 · 2018-08-27T11:09:26Z

Ok, I was confused about what the time/step actually means and didn't normalize it to the batchsize, i.e. GPU_COUNT*IMAGES_PER_GPU.

However, #875 (comment) and #875 (comment) show that @ericj974's code is actually faster, if, but unfortunately only if, we are using multiple GPUs. For a single GPU it is approx. 6 times slower than the original code of @waleedka.

acv-anvt · 2018-10-09T15:30:17Z

@ericj974 , I think you should resolved the conflicts to be complete merge it to the master branch.

Eric Juliani added 9 commits May 28, 2018 16:17

export script added

aadac2f

- Wrong shapes in descriptions (for inputs / outputs)

5a3eaab

- Harmonization of the parameter names for describing shapes (for inputs / outputs)

model.detect calls now model.detect_molded to avoid code duplicate

1e28e7b

- Addition of conda env file for cpu and gpu

3cd90cb

- Multi-GPU training: Instantiation of the base model on the cpu

bb8abee

- merge with upstream

00c9fce

- Multi-GPU training: Instantiation of the base model on the cpu

f2a77a2

parallel_model code update for keras 2.2.0 compatibility

6bbcdd7

slicing issue fixed, multi_gpu case

80d9c0b

This was referenced Jun 23, 2018

Training on multiple GPUs fails with TF 1.4.0 #511

Open

Slower training time when increasing IMAGES_PER_GPU #589

Closed

pieterblok mentioned this pull request Jun 23, 2018

When increasing the GPU count, the training becomes slower :-( #708

Open

waleedka reviewed Jul 12, 2018

View reviewed changes

Downgrade to tf 1.7

c6179ad

addition of a train script using shapes virtual dataset

638ad2f

maxfrei750 mentioned this pull request Aug 20, 2018

Systematic analysis of the multi GPU problem #875

Closed

forestriveral mentioned this pull request Sep 5, 2018

mrcnn/parallel_model.py testing error: 'Model' object has no attribute 'output_names' #906

Open

chAwater mentioned this pull request May 14, 2019

Fix ParallelModel.__init__ for the compatibility with Keras #1241

Open

areflesh approved these changes Mar 13, 2020

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi GPU code update #710

Multi GPU code update #710

ericj974 commented Jun 23, 2018 •

edited

Loading

waleedka Jul 12, 2018

ericj974 Jul 13, 2018

waleedka Jul 13, 2018

waleedka commented Jul 12, 2018

ericj974 commented Jul 13, 2018

maxfrei750 commented Aug 17, 2018

maxfrei750 commented Aug 27, 2018

acv-anvt commented Oct 9, 2018

Multi GPU code update #710

Are you sure you want to change the base?

Multi GPU code update #710

Conversation

ericj974 commented Jun 23, 2018 • edited Loading

waleedka Jul 12, 2018

Choose a reason for hiding this comment

ericj974 Jul 13, 2018

Choose a reason for hiding this comment

waleedka Jul 13, 2018

Choose a reason for hiding this comment

waleedka commented Jul 12, 2018

ericj974 commented Jul 13, 2018

maxfrei750 commented Aug 17, 2018

maxfrei750 commented Aug 27, 2018

acv-anvt commented Oct 9, 2018

ericj974 commented Jun 23, 2018 •

edited

Loading