Closed
Description
Running Auto3d with instance22 works with all networks. When I wanted to duplicate the data json in order to simulate larger dataset all networks worked except SwinUnet.
To Reproduce
1 - use Auto3d with instance22 using the dataset.json attached. I changed the extension to txt as json was not supported for uploading
2 - run script below to only trigger swinunet
train_1_node(){
FOLDER="/workspace/${WORK_DIR}/${MODEL}_${FOLD}"
rm -r $FOLDER/model_fold$FOLD
CONF_FOLDER=${FOLDER}"/configs"
rm ${FOLDER}/${MODEL}.log
(time \
torchrun --nnodes=1 --nproc_per_node=8 \
${SCRIPT} run \
--config_file "['${CONF_FOLDER}/hyper_parameters.yaml','${CONF_FOLDER}/network.yaml','${CONF_FOLDER}/transforms_train.yaml','${CONF_FOLDER}/transforms_validate.yaml']" \
$EXTRA_PRAMS ) 2>&1 | tee -i -p ${FOLDER}/${MODEL}.log
}
swinunetr(){
MODEL="swinunetr"
SCRIPT="-m ${WORK_DIR}.${MODEL}_${FOLD}.scripts.train"
## new paramets makes it run for 20,000 epochs !! force it to 1,500
EXTRA_PRAMS=" --num_images_per_batch 16"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_patches_per_image 1"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations 1500"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations_per_validation 100"
EXTRA_PRAMS=$EXTRA_PRAMS" --num_sw_batch_size 36"
train_1_node
}
swinunetr
Error
epoch 8/210
learning rate is set to 0.0001
[2022-11-29 21:44:18] 1/7, train_loss: 0.4237
[2022-11-29 21:44:19] 2/7, train_loss: 0.4575
2022-11-29 21:44:25,647 - > collate dict key "image" out of 4 keys
2022-11-29 21:44:25,701 - >> collate/stack a list of tensors
2022-11-29 21:44:25,705 - >> E: stack expects each tensor to be equal size, but got [1, 96, 96, 64] at entry 0 and [1, 96, 95, 64] at entry 10, shape [(1, 96, 96, 64), (
1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 95, 64),
(1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64)] in collate([tensor([[[[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[0.0000, 0.0000, 0.0000, ..., 0.0000, 0.0000, 0.0000],
[ 0.16601867, 0.11132774, 0.97981832, -12.53823159],
[ 0. , 0. , 0. , 1. ]])},
id: 140606314046512,
orig_size: (96, 96, 64)},
id: 140604144127376,
orig_size: (96, 96, 64)},
id: 140604144127184,
orig_size: (96, 96, 64)}]
Is batch?: False] ... )
2022-12-06 20:32:04,170 - > collate dict key "label" out of 4 keys
2022-12-06 20:32:04,219 - >> collate/stack a list of tensors
Expected behavior
As you see from the error log it actually starts training in to 1 sometimes 10 epochs then it errors out. Expected for it wo continue running