Skip to content

Commit f3c6bdf

Browse files
Merge pull request NVIDIA#764 from NVIDIA/gh/release
[UNet medical/TF2] Fix
2 parents d17b10e + 94a8f28 commit f3c6bdf

20 files changed

+139
-114
lines changed

TensorFlow2/Segmentation/UNet_Medical/README.md

Lines changed: 44 additions & 38 deletions
Original file line numberDiff line numberDiff line change
@@ -231,20 +231,20 @@ For the specifics concerning training and inference, see the [Advanced](#advance
231231

232232
This script will launch a training on a single fold and store the model’s checkpoint in the <path/to/checkpoint> directory.
233233

234-
The script can be run directly by modifying flags if necessary, especially the number of GPUs, which is defined after the `-np` flag. Since the test volume does not have labels, 20% of the training data is used for validation in 5-fold cross-validation manner. The number of fold can be changed using `--crossvalidation_idx` with an integer in range 0-4. For example, to run with 4 GPUs using fold 1 use:
234+
The script can be run directly by modifying flags if necessary, especially the number of GPUs, which is defined after the `-np` flag. Since the test volume does not have labels, 20% of the training data is used for validation in 5-fold cross-validation manner. The number of fold can be changed using `--fold` with an integer in range 0-4. For example, to run with 4 GPUs using fold 1 use:
235235

236236
```bash
237-
horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --crossvalidation_idx 1 --xla --amp
237+
horovodrun -np 4 python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode train --fold 1 --xla --amp
238238
```
239239

240240
Training will result in a checkpoint file being written to `./results` on the host machine.
241241

242242
6. Start validation/evaluation.
243243

244-
The trained model can be evaluated by passing the `--exec_mode evaluate` flag. Since evaluation is carried out on a validation dataset, the `--crossvalidation_idx` parameter should be filled. For example:
244+
The trained model can be evaluated by passing the `--exec_mode evaluate` flag. Since evaluation is carried out on a validation dataset, the `--fold` parameter should be filled. For example:
245245

246246
```bash
247-
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --crossvalidation_idx 0 --xla --amp
247+
python main.py --data_dir /data --model_dir /results --batch_size 1 --exec_mode evaluate --fold 0 --xla --amp
248248
```
249249

250250
Evaluation can also be triggered jointly after training by passing the `--exec_mode train_and_evaluate` flag.
@@ -291,19 +291,20 @@ Other folders included in the root directory are:
291291
The complete list of the available parameters for the `main.py` script contains:
292292
* `--exec_mode`: Select the execution mode to run the model (default: `train`). Modes available:
293293
* `train` - trains model from scratch.
294-
* `evaluate` - loads checkpoint (if available) and performs evaluation on validation subset (requires `--crossvalidation_idx` other than `None`).
295-
* `train_and_evaluate` - trains model from scratch and performs validation at the end (requires `--crossvalidation_idx` other than `None`).
294+
* `evaluate` - loads checkpoint (if available) and performs evaluation on validation subset (requires `--fold` other than `None`).
295+
* `train_and_evaluate` - trains model from scratch and performs validation at the end (requires `--fold` other than `None`).
296296
* `predict` - loads checkpoint (if available) and runs inference on the test set. Stores the results in `--model_dir` directory.
297297
* `train_and_predict` - trains model from scratch and performs inference.
298298
* `--model_dir`: Set the output directory for information related to the model (default: `/results`).
299299
* `--log_dir`: Set the output directory for logs (default: None).
300300
* `--data_dir`: Set the input directory containing the dataset (default: `None`).
301301
* `--batch_size`: Size of each minibatch per GPU (default: `1`).
302-
* `--crossvalidation_idx`: Selected fold for cross-validation (default: `None`).
302+
* `--fold`: Selected fold for cross-validation (default: `None`).
303303
* `--max_steps`: Maximum number of steps (batches) for training (default: `1000`).
304304
* `--seed`: Set random seed for reproducibility (default: `0`).
305305
* `--weight_decay`: Weight decay coefficient (default: `0.0005`).
306306
* `--log_every`: Log performance every n steps (default: `100`).
307+
* `--evaluate_every`: Evaluate every n steps (default: `0` - evaluate once at the end).
307308
* `--learning_rate`: Model’s learning rate (default: `0.0001`).
308309
* `--augment`: Enable data augmentation (default: `False`).
309310
* `--benchmark`: Enable performance benchmarking (default: `False`). If the flag is set, the script runs in a benchmark mode - each iteration is timed and the performance result (in images per second) is printed at the end. Works for both `train` and `predict` execution modes.
@@ -324,43 +325,48 @@ usage: main.py [-h]
324325
[--exec_mode {train,train_and_predict,predict,evaluate,train_and_evaluate}]
325326
[--model_dir MODEL_DIR] --data_dir DATA_DIR [--log_dir LOG_DIR]
326327
[--batch_size BATCH_SIZE] [--learning_rate LEARNING_RATE]
327-
[--crossvalidation_idx CROSSVALIDATION_IDX]
328-
[--max_steps MAX_STEPS] [--weight_decay WEIGHT_DECAY]
328+
[--fold FOLD] [--max_steps MAX_STEPS]
329+
[--evaluate_every EVALUATE_EVERY] [--weight_decay WEIGHT_DECAY]
329330
[--log_every LOG_EVERY] [--warmup_steps WARMUP_STEPS]
330331
[--seed SEED] [--augment] [--benchmark]
331332
[--amp] [--xla]
332333

333334
UNet-medical
334335

335336
optional arguments:
336-
-h, --help show this help message and exit
337-
--exec_mode {train,train_and_predict,predict,evaluate,train_and_evaluate}
338-
Execution mode of running the model
339-
--model_dir MODEL_DIR
340-
Output directory for information related to the model
341-
--data_dir DATA_DIR Input directory containing the dataset for training
342-
the model
343-
--log_dir LOG_DIR Output directory for training logs
344-
--batch_size BATCH_SIZE
345-
Size of each minibatch per GPU
346-
--learning_rate LEARNING_RATE
347-
Learning rate coefficient for AdamOptimizer
348-
--crossvalidation_idx CROSSVALIDATION_IDX
349-
Chosen fold for cross-validation. Use None to disable
350-
cross-validation
351-
--max_steps MAX_STEPS
352-
Maximum number of steps (batches) used for training
353-
--weight_decay WEIGHT_DECAY
354-
Weight decay coefficient
355-
--log_every LOG_EVERY
356-
Log performance every n steps
357-
--warmup_steps WARMUP_STEPS
358-
Number of warmup steps
359-
--seed SEED Random seed
360-
--augment Perform data augmentation during training
361-
--benchmark Collect performance metrics during training
362-
--amp Train using TF-AMP
363-
--xla Train using XLA
337+
-h, --help show this help message and exit
338+
--exec_mode {train,train_and_predict,predict,evaluate,train_and_evaluate}
339+
Execution mode of running the model
340+
--model_dir MODEL_DIR
341+
Output directory for information related to the model
342+
--data_dir DATA_DIR Input directory containing the dataset for training
343+
the model
344+
--log_dir LOG_DIR Output directory for training logs
345+
--batch_size BATCH_SIZE
346+
Size of each minibatch per GPU
347+
--learning_rate LEARNING_RATE
348+
Learning rate coefficient for AdamOptimizer
349+
--fold FOLD Chosen fold for cross-validation. Use None to disable
350+
cross-validation
351+
--max_steps MAX_STEPS
352+
Maximum number of steps (batches) used for training
353+
--weight_decay WEIGHT_DECAY
354+
Weight decay coefficient
355+
--log_every LOG_EVERY
356+
Log performance every n steps
357+
--evaluate_every EVALUATE_EVERY
358+
Evaluate every n steps
359+
--warmup_steps WARMUP_STEPS
360+
Number of warmup steps
361+
--seed SEED Random seed
362+
--augment Perform data augmentation during training
363+
--no-augment
364+
--benchmark Collect performance metrics during training
365+
--no-benchmark
366+
--use_amp, --amp Train using TF-AMP
367+
--use_xla, --xla Train using XLA
368+
--use_trt Use TF-TRT
369+
--resume_training Resume training from a checkpoint
364370
```
365371

366372

@@ -420,7 +426,7 @@ horovodrun -np <number/of/gpus> python main.py --data_dir /data [other parameter
420426
The main result of the training are checkpoints stored by default in `./results/` on the host machine, and in the `/results` in the container. This location can be controlled
421427
by the `--model_dir` command-line argument, if a different location was mounted while starting the container. In the case when the training is run in `train_and_predict` mode, the inference will take place after the training is finished, and inference results will be stored to the `/results` directory.
422428

423-
If the `--exec_mode train_and_evaluate` parameter was used, and if `--crossvalidation_idx` parameter is set to an integer value of {0, 1, 2, 3, 4}, the evaluation of the validation set takes place after the training is completed. The results of the evaluation will be printed to the console.
429+
If the `--exec_mode train_and_evaluate` parameter was used, and if `--fold` parameter is set to an integer value of {0, 1, 2, 3, 4}, the evaluation of the validation set takes place after the training is completed. The results of the evaluation will be printed to the console.
424430

425431
### Inference process
426432

TensorFlow2/Segmentation/UNet_Medical/examples/unet_1GPU.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP32 on 1 GPU and trains for 6400 iterations with batch_size 8. Usage:
1616
# bash unet_FP32_1GPU.sh <path to dataset> <path to results directory>
1717

18-
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --log_dir $2/log.json
18+
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --fold 0 --augment --xla --log_dir $2/log.json

TensorFlow2/Segmentation/UNet_Medical/examples/unet_8GPU.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP32 on 8 GPUs and trains for 6400 iterations with batch_size 8. Usage:
1616
# bash unet_FP32_8GPU.sh <path to dataset> <path to results directory>
1717

18-
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --log_dir $2/log.json
18+
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --fold 0 --augment --xla --log_dir $2/log.json

TensorFlow2/Segmentation/UNet_Medical/examples/unet_INFER.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP32 on 1 GPU for inference batch_size 1. Usage:
1616
# bash unet_INFER_FP32.sh <path to this repository> <path to dataset> <path to results directory>
1717

18-
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla
18+
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla --fold 0

TensorFlow2/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP32 on 1 GPU for inference benchmarking. Usage:
1616
# bash unet_INFER_BENCHMARK_FP32.sh <path to dataset> <path to results directory> <batch size>
1717

18-
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla
18+
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla --fold 0

TensorFlow2/Segmentation/UNet_Medical/examples/unet_INFER_BENCHMARK_TF-AMP.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP16 on 1 GPU for inference benchmarking. Usage:
1616
# bash unet_INFER_BENCHMARK_TF-AMP.sh <path to dataset> <path to results directory> <batch size>
1717

18-
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla --amp
18+
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size $3 --exec_mode predict --benchmark --warmup_steps 200 --max_steps 600 --xla --amp --fold 0

TensorFlow2/Segmentation/UNet_Medical/examples/unet_INFER_TF-AMP.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP16 on 1 GPU for inference batch_size 1. Usage:
1616
# bash unet_INFER_TF-AMP.sh <path to dataset> <path to results directory>
1717

18-
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla --amp
18+
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --batch_size 1 --exec_mode predict --xla --amp --fold 0

TensorFlow2/Segmentation/UNet_Medical/examples/unet_TF-AMP_1GPU.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP16 on 1 GPU and trains for 6400 iterations batch_size 8. Usage:
1616
# bash unet_TF-AMP_1GPU.sh <path to dataset> <path to results directory>
1717

18-
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp --log_dir $2/log.json
18+
horovodrun -np 1 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --fold 0 --augment --xla --amp --log_dir $2/log.json

TensorFlow2/Segmentation/UNet_Medical/examples/unet_TF-AMP_8GPU.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,4 +15,4 @@
1515
# This script launches U-Net run in FP16 on 8 GPUs and trains for 6400 iterations batch_size 8. Usage:
1616
# bash unet_TF-AMP_8GPU.sh <path to dataset> <path to results directory>
1717

18-
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --crossvalidation_idx 0 --augment --xla --amp --log_dir $2/log.json
18+
horovodrun -np 8 python main.py --data_dir $1 --model_dir $2 --log_every 100 --max_steps 6400 --batch_size 8 --exec_mode train_and_evaluate --fold 0 --augment --xla --amp --log_dir $2/log.json

0 commit comments

Comments
 (0)