Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[object detection feature request]: use multiple gpu for training #1972

Closed
chakpongchung opened this issue Jul 17, 2017 · 43 comments
Closed

Comments

@chakpongchung
Copy link

System information

  • What is the top-level directory of the model you are using:
  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
  • TensorFlow installed from (source or binary): from pip
  • TensorFlow version (use command below):

('v1.2.0-5-g435cdfc', '1.2.1')

  • Bazel version (if compiling from source):
  • CUDA/cuDNN version:
  • GPU model and memory: 1080TI *2
  • Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

I am trying to use two gpu to speed up the training with data parallelism, does the code have such features? Would Model parallelism be faster?

@asimshankar
Copy link
Contributor

@jch1 : Do tweaks need to be made for multi-GPU support? Is that on the cards?

@asimshankar asimshankar added type:feature stat:awaiting model gardener Waiting on input from TensorFlow model gardener labels Jul 18, 2017
@jch1
Copy link
Contributor

jch1 commented Jul 18, 2017

Hi @chakpongchung , @asimshankar - Multi-GPU is already supported, but we don't have documentation for it (and currently don't have the cycles to work on this) --- it relies on slim's model_deploy package (which is also under tensorflow/models) and to control it, you have to set the num_clones parameter in train.py --- but you may have to tweak a few other things such as queue sizes to control memory usage.

@YanLiang0813
Copy link

YanLiang0813 commented Jul 20, 2017

I have four GPU and set num_clones=4,it seems just use one GPU and the speed just as same as use one GPU,did I should set other parameters?? @jch1

@insikk
Copy link

insikk commented Jul 28, 2017

@YanLiang0813 Did you success training it on multi GPU? I also increased batch_size to 2 with num_clones=2, so each clone get 1 images. However there was error in backpropagation. What parameteres did you also changed?

@drorhilman
Copy link

by just changing num_clones=2 I have two GPU running, with about 50% speed increase, (on azure K60 GPUs)

@danvass
Copy link

danvass commented Sep 24, 2017

@drorhilman I'm trying to configure usage of multiple GPUs on Azure N-series (specifically NC with the K80s). However, the GPUs seem to not have peer to peer access. How were you able to get them to communicate with one another in order to use multiple GPUs? I made a forum thread here: https://social.msdn.microsoft.com/Forums/en-US/c81a26b7-3770-4772-acc8-6ef5bd868108/training-neural-network-or-other-machine-learning-model-on-multiple-gpus-using-the-nseries?forum=MachineLearning

@louisquinn
Copy link

I have:

  • Two 1080Ti's
  • CUDA 8.0 and cuDNN 6
  • Changed num_clones=2 and increased batch_size to 2.

It only trained for one class, and the detection/eval results for that one class were very bad, as in no recognition of the desired objects at all, even after many iterations. I also did not notice any increase in speed, is it safe to assume that the effective number of steps doubles?

I then switched back to just using one GPU and everything trained fine.

@drorhilman, did you end up getting good results?

@ybsave
Copy link

ybsave commented Nov 1, 2017

I also just set the num_clones=2, and change the setting file batch_size = 2 (originally 1). I got about almost the same or a little slower speed of steps per second. This meets my expectation; I suppose the speed should be slower a little, as batch size also doubled.

I also not understand why @drorhilman can get 50% speed increase. If still set batch_size to 1, it would be meaningless of using two GPUs, right? If set to 2, the speed should be slower a little, right?

@wendal
Copy link

wendal commented Nov 2, 2017

my way:

python3 train.py  --logtostderr  \
          --pipeline_config_path=model/ssd_inception_v2.config \
          --train_dir=model/train \
          --num_clones=2 --ps_tasks=1

I had two 1080ti

PS: you have to add ps_tasks

(tensorflow) root@odte:/opt/swiper.model/labs/v43# nvidia-smi
Thu Nov  2 12:28:26 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 48%   68C    P2   218W / 250W |  10798MiB / 11171MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 38%   64C    P2   235W / 250W |  10796MiB / 11172MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1076      G   /usr/lib/xorg/Xorg                            60MiB |
|    0      1629      C   python3                                    10723MiB |
|    1      1629      C   python3                                    10783MiB |
+-----------------------------------------------------------------------------+

@chenyuZha
Copy link

@wendal In your case,the batch_size=1or 2? Because when I set batch_size=2, as my tf record is about 9G, I get the OOM error immediately, but which works well when I use just one GPU.

@wendal
Copy link

wendal commented Nov 9, 2017

@chakpongchung I use two gpu, so num_clones=2,

  • ssd_inception_v2 batch_size=16
  • faster_rcnn_resnet101 batch_size=8

which GPU you using ??

@wendal
Copy link

wendal commented Nov 9, 2017

my train.tfrecord is about 5.5G

BTW, I don't run eval at same machine, it slow down the training.

@zihuaweng
Copy link

zihuaweng commented Dec 7, 2017

I got an error running the code that @wendal offered under 1.2 version, but updating to 1.4 version works fine for me.

@woody-kwon
Copy link

@wendal Thanks for your comments. I solved using 'dual gpu' issue.

@obendidi
Copy link

@woody-kwon can you share how you solved it ? I didn't find any 'dual gpu' issue , thank you

@woody-kwon
Copy link

I solved "dual cpu issue" using options "--num_clones=2 --ps_tasks=1".
plz refer wendal`s article.

@davidblumntcgeo
Copy link

This is incredibly helpful, thank you @wendal and others!

Has anyone successfully used multiple GPUs with the OD API running in the cloud on GCP ML Engine? If so, how do you set num_clones and ps_tasks in order to use the GPUs on the master and all of the worker machines in the cluster? (Do these arguments just affect an individual machine, or the total number of GPUs in a cluster?)

Also, if you were successful, what TF runtime were you using, and did you have to make any other special mods to the code to fix other bugs (several of which have been reported)?

@a819721810
Copy link

2 GPUs for Faster R-CNN can use 4 bitch size? @wendal

@wendal
Copy link

wendal commented May 17, 2018

@a819721810 depens on your GPU memory. if it is too high, tf will throw OOM

@spk921
Copy link

spk921 commented Jul 13, 2018

@wendal I can't run multiple-gpu with faster_rcnn_resnet101_pets.config. Have you tried with faster RCNN ?

Traceback (most recent call last):
File "object_detection/train.py", line 183, in
tf.app.run()
File "/home/sangpilk/pyenv/py27/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "object_detection/train.py", line 179, in main
graph_hook_fn=graph_rewriter_fn)
File "/home/sangpilk/git/dash_net/research/object_detection/trainer.py", line 287, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/sangpilk/git/dash_net/research/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/sangpilk/git/dash_net/research/object_detection/trainer.py", line 179, in _create_losses
train_config.use_multiclass_scores)
ValueError: need more than 0 values to unpack

@davidblumntcgeo
Copy link

@spk921 , I've seen the "ValueError: need more than 0 values to unpack" when I've had mistakes in my config file that prevented the input dictionary from being populated with labelled images and being loaded into GPU memory. Check that all your filepaths are correct (e.g. TF records, label maps). Check that your num_clones argument is equal to your number of GPUs and that ps_tasks argument is set to 1. Check that the batch size in your config file is a multiple of num_clones. Check that you have a valid training TF record containing labelled data, and that it really is at the path where you think it is. Check that TF can access that path. Check that the labeled data in your training TF record are the same size as the parameters in your config file imply. This list isn't exhaustive.

@spk921
Copy link

spk921 commented Jul 13, 2018

@davidblumntcgeo Thank you for the info. I am using 4GPU then my ps_tasks still as 1? Also, could you give me detail information regarding " labeled data in your training TF record are the same size as the parameters " ? I am running 3 classes

@wendal
Copy link

wendal commented Jul 15, 2018

learning rate too high, make it lower, maybe

@austinmw
Copy link

For other models --num_clones=4 --ps_tasks=1 works well for me, but with NasNet model this doesn't work. Anyone know what parameters I need with this model for single machine with 4 GPUs?

@spk921
Copy link

spk921 commented Jul 17, 2018

@wendal Thanks solve the problem by changing the target class number.
However, what is sha256 ?
'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
Is this make fetching data faster?

@wendal
Copy link

wendal commented Jul 17, 2018

using no ASCII ??

@spk921
Copy link

spk921 commented Jul 28, 2018

@wendal so "sha256" makes faster data fetching? In the instruction of custom data-set creation there were no sha256 but the sample code for tfRecord has "sha256" what is " 'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8'))," this for?

@karansomaiah
Copy link

karansomaiah commented Aug 6, 2018

Hello everyone,
The num_clones and ps_tasks option is available for the older versions where "train.py" is used for training the models. However, the newer versions require "model_main.py" to be used and the flags mentioned in the file do not mention num_clones and ps_tasks. Does somebody know how to mention that while running on "model_main.py" ?

Edits:
I did a bit of searching and found this link useful. I think it still is an open issue and will be addressed soon. For the time being num_clones and ps_tasks with train.py is the only option. Please do mention if anyone has found a work-around for this.
Mentioned in this link

@densombs
Copy link

Hey everyone,
I implemented the num_clones and ps_tasks in my train.py as described above. Yet I get an error that I can't seem to get my head around.

TypeError: cluster must be a dictionary mapping one or more job names to lists of network adresses, or a 'ClusterDef' protocol buffer

I am using TF for gpu 1.8. The training works fine if the above settings in the train.py are not made. Batch size in faster_rcnn_inception_v2_pets.config is set to 2

Sorry for any mistakes, so far I found an answer to all my issues and this is therefore my first post
Thank you!

@iampj121
Copy link

hi all,
I am training object detection model using tensor flow object detection API in my CPU, though it was taking more time, it is giving appropriate results. how can i distribute the task so that computation time and load is reduced or may be can anyone tell me how to connect CPU's in distributed way.
Thanks in advance

@nathanaherne
Copy link

This is incredibly helpful, thank you @wendal and others!

Has anyone successfully used multiple GPUs with the OD API running in the cloud on GCP ML Engine? If so, how do you set num_clones and ps_tasks in order to use the GPUs on the master and all of the worker machines in the cluster? (Do these arguments just affect an individual machine, or the total number of GPUs in a cluster?)

Also, if you were successful, what TF runtime were you using, and did you have to make any other special mods to the code to fix other bugs (several of which have been reported)?

@davidblumntcgeo did you ever find a solution to running OD API using multiple GPU on GCP ML Engine? I am having out of memory issues training on faster_rcnn_resnet101 and I would like to solve this by using more than 1 GPU.

@davidblumntcgeo
Copy link

@nathanaherne , yes, I did. I used runtime 1.6. The only code mods that I had to to the code were those described in #2739 by @andersskog (I haven't cloned the OD API repo recently, and its possible that these fixes were incorporated in a recent commit). Otherwise, I followed the advice of @wendal in this issue, and in addition to using a small batch size (num_GPUs x 1 for the faster rcnn models). Finally I made other tweaks to the pipeline config file to reduce memory usage as described in #1817 by @derekjchow and others. Eventually I shook the OOM error.

@nathanaherne
Copy link

nathanaherne commented Nov 6, 2018

@davidblumntcgeo thank you for responding to my question so soon.

I will do the recommended changes and see how it goes running on multiple GPU. I am using runtime version 1.9.

@liangxiao05
Copy link
Contributor

@wendal ,hi,although you can use '--num_clones=2 --ps_tasks=1' to train with two gpus,does it really speed up the training?
As my test, I got almost the same result with @gustavomr in this link #1428

1.one GPU and batch size = 32
Result: ~ 0.11 (steps/sec)

2.one GPU and batch size = 64
Result: ~ 0.21 (steps/sec)

3.two GPU and batch size = 32
Result: ~ 0.11 (steps/sec)

4.two GPU and batch size = 64
Result: ~ 0.21 (steps/sec)

1 vs 3:the same batch size means the same images you use every step, that is the same training times used for 1 epoch. Two gpu does't speed up the training,that is confused!
@jch1 @pkulzc @tfboyd can you share more guides,thanks!

@raudipra
Copy link

Hi @liangxiao05, I think you have to read this.

Using multi gpu will not speed up your training, to speed up your training you need to recalibrate your step or epoch based on how many gpu you have.
Referring to your case what really happens is,
Case 1 : will process 32 images / 0.11 sec
Case 3 : will process 32 images * 2 gpu / 0.11 sec

@liangxiao05
Copy link
Contributor

liangxiao05 commented Jan 17, 2019

Referring to your case what really happens is,
Case 1 : will process 32 images / 0.11 sec
Case 3 : will process 32 images * 2 gpu / 0.11 sec

@raudipra ,I use the Object Detection API to train my models.
In this API, the images per gpu are batch_size/gpu_numbers .
Case 3 : will process total 32 images (2 gpu) ,that is 16 images ( per gpu) / 0.11 sec
so ,2 gpu seems to make no difference to one gpu ?

@raudipra
Copy link

raudipra commented Jan 17, 2019

Referring to your case what really happens is,
Case 1 : will process 32 images / 0.11 sec
Case 3 : will process 32 images * 2 gpu / 0.11 sec

@raudipra ,I use the Object Detection API to train my models.
In this API, the images per gpu are batch_size/gpu_numbers .
Case 3 : will process total 32 images (2 gpu) ,that is 16 images ( per gpu) / 0.11 sec
so ,2 gpu seems to make no difference to one gpu ?

@liangxiao05 Where did you find information that says 'the images per GPU are batch_size/gpu_numbers'?
afaik, based on replication and distributed tensorflow implementation, the number of images per gpu is the same as batch size. So using two GPUs will give you 2 times number of images being trained per same period of time.

@liangxiao05
Copy link
Contributor

@raudipra I ever think the tensorflow implementation architecture works like the way you said,howerver it doesn't. As you can see the code released

batch_size = train_config.batch_size // num_clones

If you use 2 gpu to train , set num_clones =2 ,and every gpu clone processes half batch_size images .Also, you can visualize the training on tensorboard ,and you will see in every clone the data input is half batch size.

@raudipra
Copy link

@liangxiao05 interesting, I will try experimenting and do some exploration first, I will let you know if I get something

@wlongxiang
Copy link

Any progress with this feature?

@lighTQ
Copy link

lighTQ commented Mar 19, 2019

@chakpongchung I use two gpu, so num_clones=2,

  • ssd_inception_v2 batch_size=16
  • faster_rcnn_resnet101 batch_size=8

which GPU you using ??

faster rcnn default batch_size=1,did you change something ?so , it worked

@skaldesh
Copy link

skaldesh commented Jul 5, 2019

Hello, is multi-gpu now supported when using model_main.py?

@tensorflowbutler
Copy link
Member

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests