[object detection feature request]: use multiple gpu for training #1972

chakpongchung · 2017-07-17T14:01:25Z

System information

What is the top-level directory of the model you are using:
Have I written custom code (as opposed to using a stock example script provided in TensorFlow):
OS Platform and Distribution (e.g., Linux Ubuntu 16.04):Linux Ubuntu 16.04
TensorFlow installed from (source or binary): from pip
TensorFlow version (use command below):

('v1.2.0-5-g435cdfc', '1.2.1')

Bazel version (if compiling from source):
CUDA/cuDNN version:
GPU model and memory: 1080TI *2
Exact command to reproduce:

You can collect some of this information using our environment capture script:

https://github.com/tensorflow/tensorflow/tree/master/tools/tf_env_collect.sh

You can obtain the TensorFlow version with

python -c "import tensorflow as tf; print(tf.GIT_VERSION, tf.VERSION)"

I am trying to use two gpu to speed up the training with data parallelism, does the code have such features? Would Model parallelism be faster?

asimshankar · 2017-07-18T00:30:55Z

@jch1 : Do tweaks need to be made for multi-GPU support? Is that on the cards?

jch1 · 2017-07-18T01:21:42Z

Hi @chakpongchung , @asimshankar - Multi-GPU is already supported, but we don't have documentation for it (and currently don't have the cycles to work on this) --- it relies on slim's model_deploy package (which is also under tensorflow/models) and to control it, you have to set the num_clones parameter in train.py --- but you may have to tweak a few other things such as queue sizes to control memory usage.

YanLiang0813 · 2017-07-20T16:06:13Z

I have four GPU and set num_clones=4,it seems just use one GPU and the speed just as same as use one GPU,did I should set other parameters?? @jch1

insikk · 2017-07-28T08:59:21Z

@YanLiang0813 Did you success training it on multi GPU? I also increased batch_size to 2 with num_clones=2, so each clone get 1 images. However there was error in backpropagation. What parameteres did you also changed?

drorhilman · 2017-08-26T18:31:09Z

by just changing num_clones=2 I have two GPU running, with about 50% speed increase, (on azure K60 GPUs)

danvass · 2017-09-24T09:17:03Z

@drorhilman I'm trying to configure usage of multiple GPUs on Azure N-series (specifically NC with the K80s). However, the GPUs seem to not have peer to peer access. How were you able to get them to communicate with one another in order to use multiple GPUs? I made a forum thread here: https://social.msdn.microsoft.com/Forums/en-US/c81a26b7-3770-4772-acc8-6ef5bd868108/training-neural-network-or-other-machine-learning-model-on-multiple-gpus-using-the-nseries?forum=MachineLearning

louisquinn · 2017-09-26T05:10:29Z

I have:

Two 1080Ti's
CUDA 8.0 and cuDNN 6
Changed num_clones=2 and increased batch_size to 2.

It only trained for one class, and the detection/eval results for that one class were very bad, as in no recognition of the desired objects at all, even after many iterations. I also did not notice any increase in speed, is it safe to assume that the effective number of steps doubles?

I then switched back to just using one GPU and everything trained fine.

@drorhilman, did you end up getting good results?

ybsave · 2017-11-01T17:17:54Z

I also just set the num_clones=2, and change the setting file batch_size = 2 (originally 1). I got about almost the same or a little slower speed of steps per second. This meets my expectation; I suppose the speed should be slower a little, as batch size also doubled.

I also not understand why @drorhilman can get 50% speed increase. If still set batch_size to 1, it would be meaningless of using two GPUs, right? If set to 2, the speed should be slower a little, right?

wendal · 2017-11-02T04:28:45Z

my way:

python3 train.py  --logtostderr  \
          --pipeline_config_path=model/ssd_inception_v2.config \
          --train_dir=model/train \
          --num_clones=2 --ps_tasks=1

I had two 1080ti

PS: you have to add ps_tasks

(tensorflow) root@odte:/opt/swiper.model/labs/v43# nvidia-smi
Thu Nov  2 12:28:26 2017
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 384.90                 Driver Version: 384.90                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108...  Off  | 00000000:01:00.0  On |                  N/A |
| 48%   68C    P2   218W / 250W |  10798MiB / 11171MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+
|   1  GeForce GTX 108...  Off  | 00000000:02:00.0 Off |                  N/A |
| 38%   64C    P2   235W / 250W |  10796MiB / 11172MiB |     97%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      1076      G   /usr/lib/xorg/Xorg                            60MiB |
|    0      1629      C   python3                                    10723MiB |
|    1      1629      C   python3                                    10783MiB |
+-----------------------------------------------------------------------------+

chenyuZha · 2017-11-09T12:00:45Z

@wendal In your case,the batch_size=1or 2? Because when I set batch_size=2, as my tf record is about 9G, I get the OOM error immediately, but which works well when I use just one GPU.

wendal · 2017-11-09T13:50:05Z

@chakpongchung I use two gpu, so num_clones=2,

ssd_inception_v2 batch_size=16
faster_rcnn_resnet101 batch_size=8

which GPU you using ??

wendal · 2017-11-09T13:51:37Z

my train.tfrecord is about 5.5G

BTW, I don't run eval at same machine, it slow down the training.

zihuaweng · 2017-12-07T10:47:53Z

I got an error running the code that @wendal offered under 1.2 version, but updating to 1.4 version works fine for me.

woody-kwon · 2018-01-09T04:36:47Z

@wendal Thanks for your comments. I solved using 'dual gpu' issue.

obendidi · 2018-01-23T13:09:26Z

@woody-kwon can you share how you solved it ? I didn't find any 'dual gpu' issue , thank you

woody-kwon · 2018-01-24T02:03:25Z

I solved "dual cpu issue" using options "--num_clones=2 --ps_tasks=1".
plz refer wendal`s article.

davidblumntcgeo · 2018-05-12T15:19:17Z

This is incredibly helpful, thank you @wendal and others!

Has anyone successfully used multiple GPUs with the OD API running in the cloud on GCP ML Engine? If so, how do you set num_clones and ps_tasks in order to use the GPUs on the master and all of the worker machines in the cluster? (Do these arguments just affect an individual machine, or the total number of GPUs in a cluster?)

Also, if you were successful, what TF runtime were you using, and did you have to make any other special mods to the code to fix other bugs (several of which have been reported)?

a819721810 · 2018-05-17T10:58:32Z

2 GPUs for Faster R-CNN can use 4 bitch size? @wendal

wendal · 2018-05-17T11:15:12Z

@a819721810 depens on your GPU memory. if it is too high, tf will throw OOM

spk921 · 2018-07-13T21:12:59Z

@wendal I can't run multiple-gpu with faster_rcnn_resnet101_pets.config. Have you tried with faster RCNN ?

Traceback (most recent call last):
File "object_detection/train.py", line 183, in
tf.app.run()
File "/home/sangpilk/pyenv/py27/local/lib/python2.7/site-packages/tensorflow/python/platform/app.py", line 126, in run
_sys.exit(main(argv))
File "object_detection/train.py", line 179, in main
graph_hook_fn=graph_rewriter_fn)
File "/home/sangpilk/git/dash_net/research/object_detection/trainer.py", line 287, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/sangpilk/git/dash_net/research/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/sangpilk/git/dash_net/research/object_detection/trainer.py", line 179, in _create_losses
train_config.use_multiclass_scores)
ValueError: need more than 0 values to unpack

davidblumntcgeo · 2018-07-13T22:39:19Z

@spk921 , I've seen the "ValueError: need more than 0 values to unpack" when I've had mistakes in my config file that prevented the input dictionary from being populated with labelled images and being loaded into GPU memory. Check that all your filepaths are correct (e.g. TF records, label maps). Check that your num_clones argument is equal to your number of GPUs and that ps_tasks argument is set to 1. Check that the batch size in your config file is a multiple of num_clones. Check that you have a valid training TF record containing labelled data, and that it really is at the path where you think it is. Check that TF can access that path. Check that the labeled data in your training TF record are the same size as the parameters in your config file imply. This list isn't exhaustive.

spk921 · 2018-07-13T22:55:10Z

@davidblumntcgeo Thank you for the info. I am using 4GPU then my ps_tasks still as 1? Also, could you give me detail information regarding " labeled data in your training TF record are the same size as the parameters " ? I am running 3 classes

wendal · 2018-07-15T15:52:08Z

learning rate too high, make it lower, maybe

austinmw · 2018-07-16T19:01:59Z

For other models --num_clones=4 --ps_tasks=1 works well for me, but with NasNet model this doesn't work. Anyone know what parameters I need with this model for single machine with 4 GPUs?

spk921 · 2018-07-17T00:09:09Z

@wendal Thanks solve the problem by changing the target class number.
However, what is sha256 ?
'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8')),
Is this make fetching data faster?

wendal · 2018-07-17T03:29:33Z

using no ASCII ??

spk921 · 2018-07-28T22:54:15Z

@wendal so "sha256" makes faster data fetching? In the instruction of custom data-set creation there were no sha256 but the sample code for tfRecord has "sha256" what is " 'image/key/sha256': dataset_util.bytes_feature(key.encode('utf8'))," this for?

karansomaiah · 2018-08-06T15:52:08Z

Hello everyone,
The num_clones and ps_tasks option is available for the older versions where "train.py" is used for training the models. However, the newer versions require "model_main.py" to be used and the flags mentioned in the file do not mention num_clones and ps_tasks. Does somebody know how to mention that while running on "model_main.py" ?

Edits:
I did a bit of searching and found this link useful. I think it still is an open issue and will be addressed soon. For the time being num_clones and ps_tasks with train.py is the only option. Please do mention if anyone has found a work-around for this.
Mentioned in this link

densombs · 2018-08-16T23:45:42Z

Hey everyone,
I implemented the num_clones and ps_tasks in my train.py as described above. Yet I get an error that I can't seem to get my head around.

TypeError: cluster must be a dictionary mapping one or more job names to lists of network adresses, or a 'ClusterDef' protocol buffer

I am using TF for gpu 1.8. The training works fine if the above settings in the train.py are not made. Batch size in faster_rcnn_inception_v2_pets.config is set to 2

Sorry for any mistakes, so far I found an answer to all my issues and this is therefore my first post
Thank you!

iampj121 · 2018-08-17T10:46:12Z

hi all,
I am training object detection model using tensor flow object detection API in my CPU, though it was taking more time, it is giving appropriate results. how can i distribute the task so that computation time and load is reduced or may be can anyone tell me how to connect CPU's in distributed way.
Thanks in advance

nathanaherne · 2018-11-06T12:10:09Z

This is incredibly helpful, thank you @wendal and others!

Has anyone successfully used multiple GPUs with the OD API running in the cloud on GCP ML Engine? If so, how do you set num_clones and ps_tasks in order to use the GPUs on the master and all of the worker machines in the cluster? (Do these arguments just affect an individual machine, or the total number of GPUs in a cluster?)

Also, if you were successful, what TF runtime were you using, and did you have to make any other special mods to the code to fix other bugs (several of which have been reported)?

@davidblumntcgeo did you ever find a solution to running OD API using multiple GPU on GCP ML Engine? I am having out of memory issues training on faster_rcnn_resnet101 and I would like to solve this by using more than 1 GPU.

davidblumntcgeo · 2018-11-06T23:34:50Z

@nathanaherne , yes, I did. I used runtime 1.6. The only code mods that I had to to the code were those described in #2739 by @andersskog (I haven't cloned the OD API repo recently, and its possible that these fixes were incorporated in a recent commit). Otherwise, I followed the advice of @wendal in this issue, and in addition to using a small batch size (num_GPUs x 1 for the faster rcnn models). Finally I made other tweaks to the pipeline config file to reduce memory usage as described in #1817 by @derekjchow and others. Eventually I shook the OOM error.

nathanaherne · 2018-11-06T23:44:27Z

@davidblumntcgeo thank you for responding to my question so soon.

I will do the recommended changes and see how it goes running on multiple GPU. I am using runtime version 1.9.

liangxiao05 · 2018-12-20T15:26:37Z

@wendal ,hi,although you can use '--num_clones=2 --ps_tasks=1' to train with two gpus,does it really speed up the training?
As my test, I got almost the same result with @gustavomr in this link #1428

1.one GPU and batch size = 32
Result: ~ 0.11 (steps/sec)

2.one GPU and batch size = 64
Result: ~ 0.21 (steps/sec)

3.two GPU and batch size = 32
Result: ~ 0.11 (steps/sec)

4.two GPU and batch size = 64
Result: ~ 0.21 (steps/sec)

1 vs 3:the same batch size means the same images you use every step, that is the same training times used for 1 epoch. Two gpu does't speed up the training,that is confused!
@jch1 @pkulzc @tfboyd can you share more guides,thanks!

raudipra · 2019-01-16T03:03:06Z

Hi @liangxiao05, I think you have to read this.

Using multi gpu will not speed up your training, to speed up your training you need to recalibrate your step or epoch based on how many gpu you have.
Referring to your case what really happens is,
Case 1 : will process 32 images / 0.11 sec
Case 3 : will process 32 images * 2 gpu / 0.11 sec

liangxiao05 · 2019-01-17T08:21:23Z

Referring to your case what really happens is,
Case 1 : will process 32 images / 0.11 sec
Case 3 : will process 32 images * 2 gpu / 0.11 sec

@raudipra ,I use the Object Detection API to train my models.
In this API, the images per gpu are batch_size/gpu_numbers .
Case 3 : will process total 32 images (2 gpu) ,that is 16 images ( per gpu) / 0.11 sec
so ,2 gpu seems to make no difference to one gpu ?

raudipra · 2019-01-17T08:34:09Z

Referring to your case what really happens is,
Case 1 : will process 32 images / 0.11 sec
Case 3 : will process 32 images * 2 gpu / 0.11 sec

@raudipra ,I use the Object Detection API to train my models.
In this API, the images per gpu are batch_size/gpu_numbers .
Case 3 : will process total 32 images (2 gpu) ,that is 16 images ( per gpu) / 0.11 sec
so ,2 gpu seems to make no difference to one gpu ?

@liangxiao05 Where did you find information that says 'the images per GPU are batch_size/gpu_numbers'?
afaik, based on replication and distributed tensorflow implementation, the number of images per gpu is the same as batch size. So using two GPUs will give you 2 times number of images being trained per same period of time.

liangxiao05 · 2019-01-17T09:34:56Z

@raudipra I ever think the tensorflow implementation architecture works like the way you said,howerver it doesn't. As you can see the code released

models/research/object_detection/legacy/trainer.py

Line 271 in dcf52aa

batch_size = train_config.batch_size // num_clones

If you use 2 gpu to train , set num_clones =2 ,and every gpu clone processes half batch_size images .Also, you can visualize the training on tensorboard ,and you will see in every clone the data input is half batch size.

raudipra · 2019-01-17T10:38:17Z

@liangxiao05 interesting, I will try experimenting and do some exploration first, I will let you know if I get something

wlongxiang · 2019-02-20T15:32:31Z

Any progress with this feature?

lighTQ · 2019-03-19T03:18:34Z

@chakpongchung I use two gpu, so num_clones=2,

ssd_inception_v2 batch_size=16

faster_rcnn_resnet101 batch_size=8

which GPU you using ??

faster rcnn default batch_size=1,did you change something ?so , it worked

skaldesh · 2019-07-05T07:18:47Z

Hello, is multi-gpu now supported when using model_main.py?

tensorflowbutler · 2020-01-29T23:35:16Z

Hi There,
We are checking to see if you still need help on this, as this seems to be considerably old issue. Please update this issue with the latest information, code snippet to reproduce your issue and error you are seeing.
If we don't hear from you in the next 7 days, this issue will be closed automatically. If you don't need help on this issue any more, please consider closing this.

asimshankar added type:feature stat:awaiting model gardener Waiting on input from TensorFlow model gardener labels Jul 18, 2017

tensorflowbutler removed the stat:awaiting model gardener Waiting on input from TensorFlow model gardener label Apr 6, 2018

xiaoyongzhu mentioned this issue Apr 12, 2018

Tensorflow Object Detection API can run the multi GPU ？目标检测模型可以单机多GPU训练不？ #3951

Closed

a819721810 mentioned this issue May 20, 2018

multi GPU train using object dection API #4316

Closed

tensorflowbutler closed this as completed Feb 7, 2020

[object detection feature request]: use multiple gpu for training #1972

[object detection feature request]: use multiple gpu for training #1972

Comments

chakpongchung commented Jul 17, 2017

System information

asimshankar commented Jul 18, 2017

jch1 commented Jul 18, 2017

YanLiang0813 commented Jul 20, 2017 • edited Loading

insikk commented Jul 28, 2017 • edited Loading

drorhilman commented Aug 26, 2017

danvass commented Sep 24, 2017

louisquinn commented Sep 26, 2017

ybsave commented Nov 1, 2017 • edited Loading

wendal commented Nov 2, 2017

chenyuZha commented Nov 9, 2017

wendal commented Nov 9, 2017

wendal commented Nov 9, 2017 • edited Loading

zihuaweng commented Dec 7, 2017 • edited Loading

woody-kwon commented Jan 9, 2018

obendidi commented Jan 23, 2018

woody-kwon commented Jan 24, 2018

davidblumntcgeo commented May 12, 2018

a819721810 commented May 17, 2018

wendal commented May 17, 2018

spk921 commented Jul 13, 2018

davidblumntcgeo commented Jul 13, 2018

spk921 commented Jul 13, 2018

wendal commented Jul 15, 2018 • edited Loading

austinmw commented Jul 16, 2018

spk921 commented Jul 17, 2018

wendal commented Jul 17, 2018

spk921 commented Jul 28, 2018

karansomaiah commented Aug 6, 2018 • edited Loading

densombs commented Aug 16, 2018

iampj121 commented Aug 17, 2018

nathanaherne commented Nov 6, 2018

davidblumntcgeo commented Nov 6, 2018

nathanaherne commented Nov 6, 2018 • edited Loading

liangxiao05 commented Dec 20, 2018

raudipra commented Jan 16, 2019

liangxiao05 commented Jan 17, 2019 • edited Loading

raudipra commented Jan 17, 2019 • edited Loading

liangxiao05 commented Jan 17, 2019

raudipra commented Jan 17, 2019

wlongxiang commented Feb 20, 2019

lighTQ commented Mar 19, 2019

skaldesh commented Jul 5, 2019

tensorflowbutler commented Jan 29, 2020

YanLiang0813 commented Jul 20, 2017 •

edited

Loading

insikk commented Jul 28, 2017 •

edited

Loading

ybsave commented Nov 1, 2017 •

edited

Loading

wendal commented Nov 9, 2017 •

edited

Loading

zihuaweng commented Dec 7, 2017 •

edited

Loading

wendal commented Jul 15, 2018 •

edited

Loading

karansomaiah commented Aug 6, 2018 •

edited

Loading

nathanaherne commented Nov 6, 2018 •

edited

Loading

liangxiao05 commented Jan 17, 2019 •

edited

Loading

raudipra commented Jan 17, 2019 •

edited

Loading