Skip to content

Auto3D Swinunet fails with Instance22 dataset #5742

Closed
@AHarouni

Description

@AHarouni

Running Auto3d with instance22 works with all networks. When I wanted to duplicate the data json in order to simulate larger dataset all networks worked except SwinUnet.

To Reproduce
1 - use Auto3d with instance22 using the dataset.json attached. I changed the extension to txt as json was not supported for uploading
2 - run script below to only trigger swinunet

train_1_node(){
    FOLDER="/workspace/${WORK_DIR}/${MODEL}_${FOLD}"
    rm -r $FOLDER/model_fold$FOLD
    CONF_FOLDER=${FOLDER}"/configs"
    rm ${FOLDER}/${MODEL}.log

    (time \
    torchrun --nnodes=1 --nproc_per_node=8 \
        ${SCRIPT} run \
        --config_file "['${CONF_FOLDER}/hyper_parameters.yaml','${CONF_FOLDER}/network.yaml','${CONF_FOLDER}/transforms_train.yaml','${CONF_FOLDER}/transforms_validate.yaml']" \
        $EXTRA_PRAMS ) 2>&1 | tee -i -p ${FOLDER}/${MODEL}.log
}

swinunetr(){
    MODEL="swinunetr"
    SCRIPT="-m ${WORK_DIR}.${MODEL}_${FOLD}.scripts.train"
    ## new paramets makes it run for 20,000 epochs  !! force it to 1,500
    EXTRA_PRAMS=" --num_images_per_batch 16"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_patches_per_image 1"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations 1500"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_iterations_per_validation 100"
    EXTRA_PRAMS=$EXTRA_PRAMS" --num_sw_batch_size 36"
    train_1_node
}

swinunetr

Error

epoch 8/210
learning rate is set to 0.0001
[2022-11-29 21:44:18] 1/7, train_loss: 0.4237
[2022-11-29 21:44:19] 2/7, train_loss: 0.4575
2022-11-29 21:44:25,647 - > collate dict key "image" out of 4 keys
2022-11-29 21:44:25,701 - >> collate/stack a list of tensors
2022-11-29 21:44:25,705 - >> E: stack expects each tensor to be equal size, but got [1, 96, 96, 64] at entry 0 and [1, 96, 95, 64] at entry 10, shape [(1, 96, 96, 64), (
1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 95, 64), 
(1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64), (1, 96, 96, 64)] in collate([tensor([[[[0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],
          [0.0000, 0.0000, 0.0000,  ..., 0.0000, 0.0000, 0.0000],

[  0.16601867,   0.11132774,   0.97981832, -12.53823159],
       [  0.        ,   0.        ,   0.        ,   1.        ]])},
                                id: 140606314046512,
                                orig_size: (96, 96, 64)},
                  id: 140604144127376,
                  orig_size: (96, 96, 64)},
    id: 140604144127184,
    orig_size: (96, 96, 64)}]
Is batch?: False] ... )
2022-12-06 20:32:04,170 - > collate dict key "label" out of 4 keys
2022-12-06 20:32:04,219 - >> collate/stack a list of tensors

Expected behavior
As you see from the error log it actually starts training in to 1 sometimes 10 epochs then it errors out. Expected for it wo continue running

Metadata

Metadata

Assignees

Labels

questionFurther information is requested

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions