Skip to content

[Slim] Imagenet training not utilizing multiple GPUs efficiently #1428

Closed
@redserpent7

Description

@redserpent7

Hi,

I am running some tests on Slim's imagenet training using Inception Resnet V2. The training is done on AWS ec2 instances (p2.xlarge and p2.8xlarge): Here are the specs for both:

  1. p2.xlarge: GPUs (1), vCPUs (4), Ram (61GB)
  2. p2.8xlarge: GPUs (8), vCPUs (32), Ram (488GB)

The GPUs are all Nvidia Tesla K80

Tensorflow seems to detect and loads the training on all GPUs according to both the training output and nvidia-smi. However there does not seem to be much difference in execution times.

On the p2.xlarge instance, TF/Slim reported an average of 3.05 sec/step.
On the p2.8xlarge instance it reported an average of 2.96 sec/step

I was expecting the time to drop significantly but given the above results I do not see a huge benefit running the training on multiple GPUs.

Both instances have a copy of the same exact training datasets and scripts. I am running the training using this command:

DATASET_DIR=/imagenet2
TRAIN_DIR=/imagenet2/train_logs
python train_image_classifier.py \
	--train_dir=${TRAIN_DIR} \
	--dataset_name=imagenet \
	--dataset_split_name=train \
	--dataset_dir=${DATASET_DIR} \
	--max_number_of_steps=20000 \
	--model_name=inception_resnet_v2

Both instances running Tensorflow 1.0.1 running from binary as VM
Both instances are running Ubuntu 14.04 x64

Regards

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions