Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Slim] Imagenet training not utilizing multiple GPUs efficiently #1428

Closed
redserpent7 opened this issue May 1, 2017 · 24 comments
Closed

[Slim] Imagenet training not utilizing multiple GPUs efficiently #1428

redserpent7 opened this issue May 1, 2017 · 24 comments
Assignees
Labels
type:bug Bug in the code

Comments

@redserpent7
Copy link

redserpent7 commented May 1, 2017

Hi,

I am running some tests on Slim's imagenet training using Inception Resnet V2. The training is done on AWS ec2 instances (p2.xlarge and p2.8xlarge): Here are the specs for both:

  1. p2.xlarge: GPUs (1), vCPUs (4), Ram (61GB)
  2. p2.8xlarge: GPUs (8), vCPUs (32), Ram (488GB)

The GPUs are all Nvidia Tesla K80

Tensorflow seems to detect and loads the training on all GPUs according to both the training output and nvidia-smi. However there does not seem to be much difference in execution times.

On the p2.xlarge instance, TF/Slim reported an average of 3.05 sec/step.
On the p2.8xlarge instance it reported an average of 2.96 sec/step

I was expecting the time to drop significantly but given the above results I do not see a huge benefit running the training on multiple GPUs.

Both instances have a copy of the same exact training datasets and scripts. I am running the training using this command:

DATASET_DIR=/imagenet2
TRAIN_DIR=/imagenet2/train_logs
python train_image_classifier.py \
	--train_dir=${TRAIN_DIR} \
	--dataset_name=imagenet \
	--dataset_split_name=train \
	--dataset_dir=${DATASET_DIR} \
	--max_number_of_steps=20000 \
	--model_name=inception_resnet_v2

Both instances running Tensorflow 1.0.1 running from binary as VM
Both instances are running Ubuntu 14.04 x64

Regards

@redserpent7 redserpent7 changed the title [Slim] Imagenet training not utilizing multiple gpus efficiently [Slim] Imagenet training not utilizing multiple GPUs efficiently May 2, 2017
@tobiajo
Copy link

tobiajo commented May 3, 2017

Have you tried to set num_clones to the number of GPUs?

@drpngx drpngx added the type:bug Bug in the code label May 6, 2017
@drpngx
Copy link
Contributor

drpngx commented May 6, 2017

@tfboyd for performance.

@tfboyd
Copy link
Member

tfboyd commented May 6, 2017

That script does not detect the number of GPUs automatically. You have to set it. To double check run nvidia-smi or watch -n 3 nvidia-smi (I think that is correct) and make sure you see 8 python processes and all 8 GPUs are active. I have not run that model on slim but I have trained some of the other models and you should see a speedup.

@redserpent7
Copy link
Author

redserpent7 commented May 8, 2017

Ok I have set num_clones to 8 which is the number of GPUs I have. I can confirm that using nvidia-smi it shows me that all GPUs are being fully utilized. Volatile GPU-Util shows each GPU at ~88% - 98% and I can see all GPUs running the training process with 10937MiB.

However, the situation became worse. On a single GPU training reports an average of 3.1 sec/step. On 8 GPUs its now reporting an average of 3.4 sec/step.

So I am really confused now as it seems not using num_clones gave better results

@tfboyd
Copy link
Member

tfboyd commented May 8, 2017

I will try to reproduce it today locally and go from there. I have run a few models via slim but not this one.

@tfboyd tfboyd self-assigned this May 8, 2017
@tfboyd
Copy link
Member

tfboyd commented May 10, 2017

I have not forgotten you. I had some personal items and some cleanup to do on another task.

@tfboyd
Copy link
Member

tfboyd commented May 10, 2017

Ahh I forgot until I setup the example and ran it. What you are seeing is 100% correct and makes sense. Slim is reporting time for the step. You are running synchronized training so your images/sec would be

  • (32/3.1) * 1(GPU) = 10.3 images/sec
  • (32/3.4) * 8(GPUs) = 74 images/sec

@tfboyd tfboyd closed this as completed May 10, 2017
@redserpent7
Copy link
Author

@tfboyd interesting. So can you explain this equation? Namely what does the number 32 represent?

@tfboyd
Copy link
Member

tfboyd commented May 11, 2017

@redserpent7 Apologies and I mean that with sincerity. It makes me crazy when people do not explain things. And I apologize again if I am to detailed but that is the best way I can think of to explain it.

32 is the batch-size, which is the default and very popular so I assume that is what you used.

With one GPU you did 32 images in 3.1 seconds, which would be 10.2 images per second
With 8 GPUs you did 8 GPUS * 32 images per GPU for a total of 256 images in 3.4 seconds, which is 74 images per second.

What happens when you add more GPUs is that you are processing a larger batch size. So instead of 32 images in 3.1 seconds you are processing 256 images in 3.4 seconds. You could add this formula to the script to get images per second and account for the number of clones

I definitely understand how this is confusing. I setup a VM and setup the code before I realized the answer to your question. I know I marked this closed but please, as I said, feel free to ask more questions. I am happy to help. Best of luck.

@redserpent7
Copy link
Author

@tfboyd thaks so mucb for the info much appreciated. Well it all make sense now. So I am now trying to calculate the number of steps that I will require. I first started on 1 GPU for 10000 steps. This gave me a top-1 accuracy of 0.0022. I then tried running 100000 steps which assuming a linear progress should yeild 0.022 accuracy and using 8 GPUs should give me 8 times that.

Am I correct in my assumption?

@tfboyd
Copy link
Member

tfboyd commented May 15, 2017

@redserpent7 The curves are not usually linear and you may have to adjust your learning rate. I am not familiar with the inception_resnetv2 model. I calculate based on Epochs (number of times through the entire data set). I will use ResNet as an example. For ResNet-50 it is common (but there are other approaches) to train for 30 epochs and then reduce the learning rate from .1 to .01 and then after another 30 Epochs (60 total) reduce the learning rate from .01 to .001. So if you have 8 GPUs and are using a batch size of 32 per GPU for a total of 256 images per step then 30 Epochs would be:

1281167 [total training images in Imagenet] / 256 [images per step] * 30 [epochs] = 150,137 steps.

@Zehaos
Copy link

Zehaos commented May 19, 2017

How can we test it in asyn mode? than we can verify whether we are fully using every one of the gpus.

@tfboyd
Copy link
Member

tfboyd commented May 19, 2017

I do not believe there is an async mode built into the SLIM models for multi-GPU. For inception and ResNet async is not going to gain much on a single machine with multi-GPUs. Our benchmark scripts have Aync mode for distributed (across servers) but not within a single machine. The concept used in distributed should be applicable to local GPUs or you could run 8 local worker instances and one ps-server but that is not likely a great idea. I am not remotely versed in all models but for the limited models I tested VGG, AlexNet, ResNet, and Inception, I do not think much is gained from async locally. VGG and AlexNet drop of a little on scaling but I am not sure enough to make async worth it but I am not saying it is not interesting.

@haamoon
Copy link

haamoon commented Aug 16, 2017

@tfboyd shouldn't decay_steps here be also divided by FLAGS.num_clones then?

@mp7777
Copy link

mp7777 commented Oct 11, 2017

Hi Guys,
I can see how the training can be performed on multiple GPU using num_clones. However it does not seem to work for eval_image_classifier to run evaluation. Could you kindly explain to me how to enable multiple-GPU on the evaluation?
thank you

@gustavomr
Copy link

It was closed, but what is the final understanding about it ?

I used the retrain.py (from tf for poets) with batch size=100, inception v3 e learning rate=0.01 and retrain.py only use GPU to create the bottlenecks. I got (0.2 imgs/sec) on that.

Using train_image_classifier.py (from tf slim) with batch_size = 32 (because I set to use GPU with num_clones=3), inception v3 and learning rate = 0.01 I got (0.2 img/sec).

I think it so confunsing because using retrain and only CPU to train got almost the same result (img/sec) compared with 3 GPUs. I used the strategy to split the batch size 100 per 3 GPUs using 32 on train_image_classifier.

@tfboyd
Copy link
Member

tfboyd commented Feb 7, 2018

TF SLIM is a little slow but .2 is not even close. What do you get with batch-size=32 with one gpu, e.g. num_clones=1. Looking at the internal regression test for TF Slim, I am seeing 400ms per step on a P100 which assuming batch-size 32 = 80 images/sec. It even got faster recently down to 275ms = 116 images/sec. The benchmark code is ~130 image/sec on a DGX-1.

Also what is your command line and GPU? I have the data local so it is pretty easy for me to reproduce and I am willing to give it a try to give you a baseline.

@gustavomr
Copy link

@tfbody im using P100 as GPU. I will test o flowers dataset using inception v3, learning rate 0.01 and 1000 steps. So, I will create 4 tests:

  1. CPUs and batch size = 100
  2. one GPU and batch size = 100
  3. three GPU and batch size = 100
  4. three GPU and batch size = 32

Which one has to be faster in your opinion?

@tfboyd
Copy link
Member

tfboyd commented Feb 8, 2018 via email

@gustavomr
Copy link

Ok, so I will setup my training on slim as you said:

  1. one GPU and batch size = 64
  2. two GPU and batch size = 64
  3. two GPU and batch size = 32

For every try I will get a log and cmd to you. As I said, my results are not faster than using retrain (using only CPU).

@gustavomr
Copy link

gustavomr commented Feb 8, 2018

@tfbody here are my tests:

TF slim (train_image_classifier.py) using this cmd:

python train_image_classifier.py
--train_dir=${TRAIN_DIR}
--dataset_name=flowers
--dataset_split_name=train
--dataset_dir=${DATASET_DIR}
--model_name=inception_v3
--checkpoint_path=${PRETRAINED_CHECKPOINT_DIR}/inception_v3.ckpt
--checkpoint_exclude_scopes=InceptionV3/Logits,InceptionV3/AuxLogits
--trainable_scopes=InceptionV3/Logits,InceptionV3/AuxLogits
--max_number_of_steps=1000
--batch_size= [32 or 64]
--num_clones=[1 or 2]
--learning_rate=0.01
--learning_rate_decay_type=fixed
--save_interval_secs=60
--save_summaries_secs=60
--log_every_n_steps=100
--optimizer=rmsprop
--weight_decay=0.00004

  1. one GPU and batch size = 32
    Result: ~ 0.11 (steps/sec)

  2. one GPU and batch size = 64
    Result: ~ 0.21 (steps/sec)

  3. two GPU and batch size = 32
    Result: ~ 0.11 (steps/sec)

  4. two GPU and batch size = 64
    Result: ~ 0.21 (steps/sec)

TF for poets (retrain.py) using this cmd:

ARCHITECTURE="inception_v3"

python -m scripts.retrain
--bottleneck_dir=tf_files/bottlenecks
--how_many_training_steps=1000
--model_dir=tf_files/models/
--summaries_dir=tf_files/training_summaries/"${ARCHITECTURE}"
--output_graph=tf_files/retrained_graph.pb
--output_labels=tf_files/retrained_labels.txt
--architecture="${ARCHITECTURE}"
--image_dir=tf_files/flower_photos

  1. one CPU and batch size = 100
    Result: ~ 0.11 (steps/sec)

So, as we can see retrain script (test 5) using one CPU has better performance compared with equivalent batch size on tf slim (test 4) using 2 GPUs. What do you think? I attached all logs on this thread. Thanks again!
flower_logs.zip

@tfboyd
Copy link
Member

tfboyd commented Feb 8, 2018

So you are mixing up the stats. It is seconds/step and then global steps in a second.

#slim_1gpu_32batch_size.log
INFO:tensorflow:global step 700: loss = 0.7442 (0.115 sec/step)
INFO:tensorflow:global_step/sec: 8.0496

#slim_1gpu_64batch_size.og
INFO:tensorflow:global step 600: loss = 0.8462 (0.210 sec/step)
INFO:tensorflow:global step 700: loss = 0.7854 (0.210 sec/step)
INFO:tensorflow:Saving checkpoint to path /tmp/flowers-models/inception_v3/model.ckpt
INFO:tensorflow:global_step/sec: 4.51951

#slim_2_gpu_64batch_size  (which would be 32 per GPU)
INFO:tensorflow:global_step/sec: 4.03413

batch-size 32 this would be 8.0496 * 32 = 256 images/sec
batch-size 64 this would be 4.5 * 64 = 256 images/sec = 288

Your examples did not include num_clones which would be num_clones=2 if you want to use 2 GPUs. I do not have a good guess at the scaling from 1 to 2 GPUs there are a lot of factors and SLIM have a single mode for doing it. It can be quasi linear to really not good. Other than that this looks decent to me. I have not run the flowers example fine tuning before but this is actually faster than I expected as training ImageNet on inceptionv3 would be closer to 130 images a second on SMX2 P100s and you are running PCIe P100s that are clocked a little slower.

This may seem like a dig but it is not. If you had posted the logs the first time I would have seen the issue instantly. I also should have guessed you mixed up steps/sec with seconds/step anyway, sorry for not realizing it instantly. I have not looked at this script in a long time.

@gustavomr
Copy link

gustavomr commented Feb 8, 2018

@tfbody I edited my post to put --num_clones. I used it running my script.

I got what you said.

*But why I get better performance using retrain (using one CPU) compared with tf slim?
*Do you have any idea?

In my opinion because tf slim use GPU (and you have option to use how many you want) it should have better performance than retrain.

@tfboyd
Copy link
Member

tfboyd commented Feb 8, 2018

edit 08-FEB-2018
Changed to 99% sure as I am never 100% positive about much of anything and I assume flowers is similar in size to ImageNet training data. I am also not positive how fast retrain should be compared to train. I think my assumptions are decent. CIFAR-10 images could get in the 100s on CPU but anything similar in size to ImageNet with inceptionV3 or ResNet is unlikely on current hardware.

I am 99% sure that you are not getting 256 images/sec with CPU. It likely ended up on the GPU anyway, you can try again with CUDA_VISIBLE_DEVICES='' python blah.py

I am almost never 100% sure of anything. I do not run this code but I did all of the performance guides and run perf tests all the time and know many of the numbers by heart.

Training Resnet50 (easier than inceptionv3) is 6.2 images a second and that is if you compiled with AVX2 and using dual Broadwells with 36 physical / 72 logical cores.

Your logs for poets, seems to show the GPUs were used not CPUs are you are stating. The poets script does not print out the step/time but I assume you are using the timestamps to make an educated guess.

On to scaling, SLIM as a very simple multi-GPU setup where the variables are placed on gpu:0. I noticed the following in your logs:

2018-02-08 08:58:07.012990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Device peer to peer matrix
2018-02-08 08:58:07.013280: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1126] DMA: 0 1 2 
2018-02-08 08:58:07.013292: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 0:   Y N N 
2018-02-08 08:58:07.013301: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 1:   N Y Y 
2018-02-08 08:58:07.013309: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1136] 2:   N Y Y 

it seems your P100 pcie are not setup to talk to each other with GPU Direct peer-to-peer based on the Matrix. 1 and 2 are but 0 seems to be hanging out alone. While I cannot be sure, that could create problems if all of the parameters on GPU:0 and I have seen that in my testing.

You could always try to isolate those to GPUs that are connected. with CUDA_VISIBLE_DEVICES.

Finally, while the SLIM code is not ideal and is not well supported at this point, I know scales in one instance. ResNet50 1xK80 = 40.5 images/sec (32 or 64 I forget) and then 8xK80 293 images/sec. Not great but faster. The total batch would be num_gpus * 32 or 64. We are working on revamping the example with the latest APIs.

So I do not just go away, For me this issue is closed as there is not much else I can do.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type:bug Bug in the code
Projects
None yet
Development

No branches or pull requests

8 participants