Parallelizing ImageNet Training with Metaflow

Repository contains code to parallise training of Imagenet using Metaflow with Kubernetes GPU Support. To train ImageNet Pytorch's example code NN are directly used here.

Metaflow On Kubernetes Setup

Instructions of Setup of Kubernetes GPU based cluster for Metaflow :
- https://github.com/valayDave/metaflow-on-kubernetes-docs
Using @kube(cpu=4,memory=40000,gpu=4,image='anibali/pytorch:cuda-10.1') will enable GPU based training on cluster.
This code works for CPU and GPU setups. GPU jobs can fail If GPU Memory is not enough. This can be solved by increasing number of GPUS in decorator vs Decreasing Batchsize.Specially noticed with Resnet when Running this same flow with only 2 GPUs.
Download Dataset From https://tiny-imagenet.herokuapp.com/
S3 Required for Parallelised GPU training on AWS.
Service based metadata provider required with
Models Trained from script :
- resnet101
- resnet152
- resnet34
- resnet18
- resnet50
- vgg11
- vgg11_bn
- vgg13
- vgg13_bn
- vgg16
- vgg16_bn
- vgg19
- vgg19_bn
- wide_resnet101_2
- wide_resnet50_2

Running Instructions

Upload dataset to S3 and change parameter dataset_s3_path in the imagenet_metaflow.py or add it into your run parameters.
To Run on Local, Comment out @kube decorator and use the below command:
```
python imagenet_metaflow.py --metadata local run --max-workers 2 
```
To Directly deploy the runtime to on Kubernetes cluster provide a service based metadata. Set Max workers in command on basis of available GPUS. Direct deploy command :
```
python imagenet_metaflow.py kube-deploy run --max-workers 2 --dont-exit
```
Analytics Can be see in analytics.ipynb. Some models are yielding nan as loss output so they are not plotted in the results.
If deploying a cluster, Ensure beefy kubernetes master instances because runtime keeps polling kubernetes.

Training Specs

Training Specs :
- 2 p2.8xlarge GPU Instances.
- 20 Epochs per Model
- 24 Hours of training
- Hyper Params in imagenet_metaflow.py

Results

TODO

Create Notebook to Show Results.
Document the Models that are trained using this script.
Run Model training in Parallel for 10 models Over More than 20 Epochs.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
Plot_Images		Plot_Images
.gitignore		.gitignore
Readme.md		Readme.md
analytics.ipynb		analytics.ipynb
constants.py		constants.py
imagenet_metaflow.py		imagenet_metaflow.py
imagenet_pytorch.py		imagenet_pytorch.py
reporting_data.py		reporting_data.py
reporting_util.py		reporting_util.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Parallelizing ImageNet Training with Metaflow

Metaflow On Kubernetes Setup

Running Instructions

Training Specs

Results

TODO

About

Releases

Packages

Languages

valayDave/imagenet-with-metaflow

Folders and files

Latest commit

History

Repository files navigation

Parallelizing ImageNet Training with Metaflow

Metaflow On Kubernetes Setup

Running Instructions

Training Specs

Results

TODO

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages