Description
Hi,
I am running some tests on Slim's imagenet training using Inception Resnet V2. The training is done on AWS ec2 instances (p2.xlarge and p2.8xlarge): Here are the specs for both:
- p2.xlarge: GPUs (1), vCPUs (4), Ram (61GB)
- p2.8xlarge: GPUs (8), vCPUs (32), Ram (488GB)
The GPUs are all Nvidia Tesla K80
Tensorflow seems to detect and loads the training on all GPUs according to both the training output and nvidia-smi. However there does not seem to be much difference in execution times.
On the p2.xlarge instance, TF/Slim reported an average of 3.05 sec/step.
On the p2.8xlarge instance it reported an average of 2.96 sec/step
I was expecting the time to drop significantly but given the above results I do not see a huge benefit running the training on multiple GPUs.
Both instances have a copy of the same exact training datasets and scripts. I am running the training using this command:
DATASET_DIR=/imagenet2
TRAIN_DIR=/imagenet2/train_logs
python train_image_classifier.py \
--train_dir=${TRAIN_DIR} \
--dataset_name=imagenet \
--dataset_split_name=train \
--dataset_dir=${DATASET_DIR} \
--max_number_of_steps=20000 \
--model_name=inception_resnet_v2
Both instances running Tensorflow 1.0.1 running from binary as VM
Both instances are running Ubuntu 14.04 x64
Regards