Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi GPU code update #710

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Multi GPU code update #710

wants to merge 11 commits into from

Conversation

ericj974
Copy link

@ericj974 ericj974 commented Jun 23, 2018

  • Slow training issue fixed by instantiating the base model on cpu
  • There was an input split issue when multi_gpu. This was fixed by reusing the slicing method proposed by default by keras
  • Some slight code update for parallel_model.py to cater for keras 2.2.0

# duplication.
with tf.device(x.device):
input_shape = K.int_shape(x)[1:]
slice_i = KL.Lambda(get_slice,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since x is an input to this Lambda, and the Lambda OPs are on one of the GPUs while x is on the CPU since it's the input to the base model, then this would require TF to move x to the GPU to slice it there and then discard all but the one slice that's needed on that GPU. So this seems slower to me because it'll have to move more data to the GPU. What am I missing?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, idea is to have the split done on CPU, data moved to each resp. GPU, processing on the resp. GPU and the CPU is in charge on concatenating and else. I actually slightly modified keras code (see https://github.com/keras-team/keras/blob/master/keras/utils/multi_gpu_utils.py), for which the comments for multi_gpu_model is useful

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I was trying to say in my previous comment is that the splitting is already being done on the CPU. The new changes make it happen on the GPU instead, and that's slower. I could be missing something, but it seems to me that the new changes are slower than the current state of the code. I'll read the links you posted below (thanks!)

@waleedka
Copy link
Collaborator

@ericj974 Thanks! I'm starting to review this, but it's a big PR so it might take me a while. It would help me a lot if you point me to any resources or discussions that explain the reasoning behind the changes you're proposing.

@ericj974
Copy link
Author

@waleedka yes sorry noticed too late that the PR would include more than multi-gpu related commits...
For multi-gpu and the fact of initializing the model on the cpu, I was actually reading the comments in https://github.com/keras-team/keras/blob/master/keras/utils/multi_gpu_utils.py along with https://www.pyimagesearch.com/2017/10/30/how-to-multi-gpu-training-with-keras-python-and-deep-learning/

@maxfrei750
Copy link
Contributor

TL;DR:
This pull request doesn't seem to work. However there are at least two persons, claiming it would work: @pieterbl86 (see #708), @ericj974 (author of this PR). Perhaps it boils down to dependencies, so it would be great to examine if this truly works. If it does indeed work for some, then what can we do to make it work for everyone?
------------------------------------------------------

First of all, I'd like to thank @waleedka for this incredibly helpful repo and all the work that has been put into this project already. Secondly, I'd like to thank @ericj974 for the efforts to solve the problems concerning the training on multiple GPUs, because I think that it is the major issue of this repo at the moment.

I'd really like to help solving the problem, but so far, I have not even been able to run the code of @ericj974, without making changes to the code or the dependencies.

In order to work on and provide a reproducible environment, I created a docker container, which should match the setup of @ericj974. It can be pulled via the following command:
docker pull maxfrei750/mrcnn:ericj974

and can be run via the following command:

nvidia-docker run --network=host -e PASSWORD=desiredPassword --rm -it -v /path/to/maskrcnn:/notebooks maxfrei750/mrcnn:ericj974

where you should replace desiredPassword with a password of your choice for the jupyter notebook and replace /path/to/maskrcnn with the path you would like jupyter to use as its base directory.

After the container has started, jupyter can be reached via http://localhost:8888.

The docker container is based on this Dockerfile (I had to change the filetype to txt, in order to upload it), which in turn is based on this file from the repo of @ericj974. Unfortunately, I had to guess some things like the cuda version (9) and the os (Ubuntu 16.04). The gpu driver (in my case 384.130) is passed through to the container from the host system

@ericj974: It would be great, if you could specify the dependencies as precisely as possible, so that I can update the docker image accordingly.

The following tests were performed on the train_shapes.ipynb file:

As I mentioned before, I can't run the code of @ericj974 without changes.

  1. Using the original code of @ericj974, I encounter issue load_weights_from_hdf5_group_by_name unknown attribute in latest keras #566, when running train_shapes.ipynb. In order to fix this issue, either the Keras version has to be downgraded or the pull request Fix Keras engine module topology to saving #662 has to be merged into the repo of @ericj974.

@ericj974: How did you solve this issue?

  1. After having fixed problem 1 the code does run on a single gpu but is very slow (~12 s/step vs. ~0.5 s/step for the latest version of @waleedka).

  2. When being run on 2 gpus, the behaviour of the code depends on how I fixed problem 1. If I merge Fix Keras engine module topology to saving #662, then python just dies with the following error stack:

Traceback (most recent call last):
  File "/usr/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel_launcher.py", line 16, in <module>
    app.launch_new_instance()
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 657, in launch_instance
    app.initialize(argv)
  File "<decorator-gen-123>", line 2, in initialize
  File "/usr/local/lib/python3.5/dist-packages/traitlets/config/application.py", line 87, in catch_config_error
    return method(app, *args, **kwargs)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 456, in initialize
    self.init_sockets()
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 238, in init_sockets
    self.shell_port = self._bind_socket(self.shell_socket, self.shell_port)
  File "/usr/local/lib/python3.5/dist-packages/ipykernel/kernelapp.py", line 180, in _bind_socket
    s.bind("tcp://%s:%i" % (self.ip, port))
  File "zmq/backend/cython/socket.pyx", line 547, in zmq.backend.cython.socket.Socket.bind
  File "zmq/backend/cython/checkrc.pxd", line 25, in zmq.backend.cython.checkrc._check_rc
zmq.error.ZMQError: Address already in use

If I solve problem 1 by downgrading Keras to 2.1.3 I get the following error:

ResourceExhaustedError (see above for traceback): OOM when allocating tensor of shape [7,7,256,1024] and type float
[[Node: training/SGD/zeros_30 = Const[dtype=DT_FLOAT, value=Tensor<type: float shape: [7,7,256,1024] values: [[[0 0 0]]]...>, _device="/job:localhost/replica:0/task:0/device:GPU:0"]()]]

I'm not shure, if the OOM error really depends on my hardware (2x GTX1080 with 8GB VRAM each). However, after having decreased the number of images per GPU to 1, the OOM error still remains.

@ericj974: Are you able to run train_shapes.ipynb with the code you supply in your repo?

I was however able to run the code with the configuration given by @pieterbl86 in issue #708.

Ubuntu 16.04:
GPU driver: 384.130
Tensorflow_gpu = 1.8
Keras version = 2.1.6

Therefore I created a docker container for this config as well:
docker pull maxfrei750/mrcnn:pieterbl86

Dockerfile.pieterbl86.txt

Unfortunately, the training was now again slower on 2 GPUs than on a single GPU (~1 s/step vs ~0.5s/step), which is basically the same behaviour like the original state of the code of @waleedka at the moment.

I tried some other things to make the code work, but so far nothing did result in the training being faster on 2 gpus than on a single gpu. I'd really appreciate any help concerning the solution of this issue.

Kind regards.

@maxfrei750
Copy link
Contributor

Ok, I was confused about what the time/step actually means and didn't normalize it to the batchsize, i.e. GPU_COUNT*IMAGES_PER_GPU.

However, #875 (comment) and #875 (comment) show that @ericj974's code is actually faster, if, but unfortunately only if, we are using multiple GPUs. For a single GPU it is approx. 6 times slower than the original code of @waleedka.

@acv-anvt
Copy link

acv-anvt commented Oct 9, 2018

@ericj974 , I think you should resolved the conflicts to be complete merge it to the master branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants