-
Notifications
You must be signed in to change notification settings - Fork 11.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multi GPU code update #710
base: master
Are you sure you want to change the base?
Conversation
ericj974
commented
Jun 23, 2018
•
edited
Loading
edited
- Slow training issue fixed by instantiating the base model on cpu
- There was an input split issue when multi_gpu. This was fixed by reusing the slicing method proposed by default by keras
- Some slight code update for parallel_model.py to cater for keras 2.2.0
- Harmonization of the parameter names for describing shapes (for inputs / outputs)
# duplication. | ||
with tf.device(x.device): | ||
input_shape = K.int_shape(x)[1:] | ||
slice_i = KL.Lambda(get_slice, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since x
is an input to this Lambda, and the Lambda OPs are on one of the GPUs while x
is on the CPU since it's the input to the base model, then this would require TF to move x
to the GPU to slice it there and then discard all but the one slice that's needed on that GPU. So this seems slower to me because it'll have to move more data to the GPU. What am I missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, idea is to have the split done on CPU, data moved to each resp. GPU, processing on the resp. GPU and the CPU is in charge on concatenating and else. I actually slightly modified keras code (see https://github.com/keras-team/keras/blob/master/keras/utils/multi_gpu_utils.py), for which the comments for multi_gpu_model is useful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I was trying to say in my previous comment is that the splitting is already being done on the CPU. The new changes make it happen on the GPU instead, and that's slower. I could be missing something, but it seems to me that the new changes are slower than the current state of the code. I'll read the links you posted below (thanks!)
@ericj974 Thanks! I'm starting to review this, but it's a big PR so it might take me a while. It would help me a lot if you point me to any resources or discussions that explain the reasoning behind the changes you're proposing. |
@waleedka yes sorry noticed too late that the PR would include more than multi-gpu related commits... |
TL;DR: First of all, I'd like to thank @waleedka for this incredibly helpful repo and all the work that has been put into this project already. Secondly, I'd like to thank @ericj974 for the efforts to solve the problems concerning the training on multiple GPUs, because I think that it is the major issue of this repo at the moment. I'd really like to help solving the problem, but so far, I have not even been able to run the code of @ericj974, without making changes to the code or the dependencies. In order to work on and provide a reproducible environment, I created a docker container, which should match the setup of @ericj974. It can be pulled via the following command: and can be run via the following command:
where you should replace desiredPassword with a password of your choice for the jupyter notebook and replace /path/to/maskrcnn with the path you would like jupyter to use as its base directory. After the container has started, jupyter can be reached via http://localhost:8888. The docker container is based on this Dockerfile (I had to change the filetype to txt, in order to upload it), which in turn is based on this file from the repo of @ericj974. Unfortunately, I had to guess some things like the cuda version (9) and the os (Ubuntu 16.04). The gpu driver (in my case 384.130) is passed through to the container from the host system @ericj974: It would be great, if you could specify the dependencies as precisely as possible, so that I can update the docker image accordingly. The following tests were performed on the train_shapes.ipynb file: As I mentioned before, I can't run the code of @ericj974 without changes.
@ericj974: How did you solve this issue?
If I solve problem 1 by downgrading Keras to 2.1.3 I get the following error:
I'm not shure, if the OOM error really depends on my hardware (2x GTX1080 with 8GB VRAM each). However, after having decreased the number of images per GPU to 1, the OOM error still remains. @ericj974: Are you able to run train_shapes.ipynb with the code you supply in your repo? I was however able to run the code with the configuration given by @pieterbl86 in issue #708.
Therefore I created a docker container for this config as well: Unfortunately, the training was now again slower on 2 GPUs than on a single GPU (~1 s/step vs ~0.5s/step), which is basically the same behaviour like the original state of the code of @waleedka at the moment. I tried some other things to make the code work, but so far nothing did result in the training being faster on 2 gpus than on a single gpu. I'd really appreciate any help concerning the solution of this issue. Kind regards. |
Ok, I was confused about what the time/step actually means and didn't normalize it to the batchsize, i.e. However, #875 (comment) and #875 (comment) show that @ericj974's code is actually faster, if, but unfortunately only if, we are using multiple GPUs. For a single GPU it is approx. 6 times slower than the original code of @waleedka. |
@ericj974 , I think you should resolved the conflicts to be complete merge it to the master branch. |