-
Notifications
You must be signed in to change notification settings - Fork 45.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[object detection feature request]: use multiple gpu for training #1972
Comments
@jch1 : Do tweaks need to be made for multi-GPU support? Is that on the cards? |
Hi @chakpongchung , @asimshankar - Multi-GPU is already supported, but we don't have documentation for it (and currently don't have the cycles to work on this) --- it relies on slim's model_deploy package (which is also under tensorflow/models) and to control it, you have to set the num_clones parameter in train.py --- but you may have to tweak a few other things such as queue sizes to control memory usage. |
I have four GPU and set num_clones=4,it seems just use one GPU and the speed just as same as use one GPU,did I should set other parameters?? @jch1 |
@YanLiang0813 Did you success training it on multi GPU? I also increased batch_size to 2 with num_clones=2, so each clone get 1 images. However there was error in backpropagation. What parameteres did you also changed? |
by just changing num_clones=2 I have two GPU running, with about 50% speed increase, (on azure K60 GPUs) |
@drorhilman I'm trying to configure usage of multiple GPUs on Azure N-series (specifically NC with the K80s). However, the GPUs seem to not have peer to peer access. How were you able to get them to communicate with one another in order to use multiple GPUs? I made a forum thread here: https://social.msdn.microsoft.com/Forums/en-US/c81a26b7-3770-4772-acc8-6ef5bd868108/training-neural-network-or-other-machine-learning-model-on-multiple-gpus-using-the-nseries?forum=MachineLearning |
I have:
It only trained for one class, and the detection/eval results for that one class were very bad, as in no recognition of the desired objects at all, even after many iterations. I also did not notice any increase in speed, is it safe to assume that the effective number of steps doubles? I then switched back to just using one GPU and everything trained fine. @drorhilman, did you end up getting good results? |
I also just set the num_clones=2, and change the setting file batch_size = 2 (originally 1). I got about almost the same or a little slower speed of steps per second. This meets my expectation; I suppose the speed should be slower a little, as batch size also doubled. I also not understand why @drorhilman can get 50% speed increase. If still set batch_size to 1, it would be meaningless of using two GPUs, right? If set to 2, the speed should be slower a little, right? |
my way:
I had two 1080ti PS: you have to add ps_tasks
|
@wendal In your case,the |
@chakpongchung I use two gpu, so num_clones=2,
which GPU you using ?? |
my train.tfrecord is about 5.5G BTW, I don't run eval at same machine, it slow down the training. |
I got an error running the code that @wendal offered under 1.2 version, but updating to 1.4 version works fine for me. |
@wendal Thanks for your comments. I solved using 'dual gpu' issue. |
@woody-kwon can you share how you solved it ? I didn't find any 'dual gpu' issue , thank you |
I solved "dual cpu issue" using options "--num_clones=2 --ps_tasks=1". |
This is incredibly helpful, thank you @wendal and others! Has anyone successfully used multiple GPUs with the OD API running in the cloud on GCP ML Engine? If so, how do you set num_clones and ps_tasks in order to use the GPUs on the master and all of the worker machines in the cluster? (Do these arguments just affect an individual machine, or the total number of GPUs in a cluster?) Also, if you were successful, what TF runtime were you using, and did you have to make any other special mods to the code to fix other bugs (several of which have been reported)? |
2 GPUs for Faster R-CNN can use 4 bitch size? @wendal |
@a819721810 depens on your GPU memory. if it is too high, tf will throw OOM |
@wendal I can't run multiple-gpu with faster_rcnn_resnet101_pets.config. Have you tried with faster RCNN ? Traceback (most recent call last): |
@spk921 , I've seen the "ValueError: need more than 0 values to unpack" when I've had mistakes in my config file that prevented the input dictionary from being populated with labelled images and being loaded into GPU memory. Check that all your filepaths are correct (e.g. TF records, label maps). Check that your num_clones argument is equal to your number of GPUs and that ps_tasks argument is set to 1. Check that the batch size in your config file is a multiple of num_clones. Check that you have a valid training TF record containing labelled data, and that it really is at the path where you think it is. Check that TF can access that path. Check that the labeled data in your training TF record are the same size as the parameters in your config file imply. This list isn't exhaustive. |
@davidblumntcgeo Thank you for the info. I am using 4GPU then my ps_tasks still as 1? Also, could you give me detail information regarding " labeled data in your training TF record are the same size as the parameters " ? I am running 3 classes |
learning rate too high, make it lower, maybe |
For other models |
@wendal Thanks solve the problem by changing the target class number. |
using no ASCII ?? |
@wendal so "sha256" makes faster data fetching? In the instruction of custom data-set creation there were no sha256 but the sample code for tfRecord has "sha256" what is " 'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8'))," this for? |
Hello everyone, Edits: |
Hey everyone, TypeError: cluster must be a dictionary mapping one or more job names to lists of network adresses, or a 'ClusterDef' protocol buffer I am using TF for gpu 1.8. The training works fine if the above settings in the train.py are not made. Batch size in faster_rcnn_inception_v2_pets.config is set to 2 Sorry for any mistakes, so far I found an answer to all my issues and this is therefore my first post |
hi all, |
@davidblumntcgeo did you ever find a solution to running OD API using multiple GPU on GCP ML Engine? I am having out of memory issues training on faster_rcnn_resnet101 and I would like to solve this by using more than 1 GPU. |
@nathanaherne , yes, I did. I used runtime 1.6. The only code mods that I had to to the code were those described in #2739 by @andersskog (I haven't cloned the OD API repo recently, and its possible that these fixes were incorporated in a recent commit). Otherwise, I followed the advice of @wendal in this issue, and in addition to using a small batch size (num_GPUs x 1 for the faster rcnn models). Finally I made other tweaks to the pipeline config file to reduce memory usage as described in #1817 by @derekjchow and others. Eventually I shook the OOM error. |
@davidblumntcgeo thank you for responding to my question so soon. I will do the recommended changes and see how it goes running on multiple GPU. I am using runtime version 1.9. |
@wendal ,hi,although you can use '--num_clones=2 --ps_tasks=1' to train with two gpus,does it really speed up the training?
1 vs 3:the same batch size means the same images you use every step, that is the same training times used for 1 epoch. Two gpu does't speed up the training,that is confused! |
Hi @liangxiao05, I think you have to read this. Using multi gpu will not speed up your training, to speed up your training you need to recalibrate your step or epoch based on how many gpu you have. |
@raudipra ,I use the Object Detection API to train my models. |
@liangxiao05 Where did you find information that says 'the images per GPU are batch_size/gpu_numbers'? |
@raudipra I ever think the tensorflow implementation architecture works like the way you said,howerver it doesn't. As you can see the code released
If you use 2 gpu to train , set num_clones =2 ,and every gpu clone processes half batch_size images .Also, you can visualize the training on tensorboard ,and you will see in every clone the data input is half batch size. |
@liangxiao05 interesting, I will try experimenting and do some exploration first, I will let you know if I get something |
Any progress with this feature? |
faster rcnn default batch_size=1,did you change something ?so , it worked |
Hello, is multi-gpu now supported when using model_main.py? |
Hi There, |
System information
('v1.2.0-5-g435cdfc', '1.2.1')
You can collect some of this information using our environment capture script:
https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh
You can obtain the TensorFlow version with
python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"
I am trying to use two gpu to speed up the training with data parallelism, does the code have such features? Would Model parallelism be faster?
The text was updated successfully, but these errors were encountered: