Skip to content

Training: Less intense GPU usage, longer run time in v0.3.18 versus v0.2.5 #255

@i7878

Description

@i7878

Hello:

We observed on identical inputs and parameters that training GPU usage is less intense and run time in v0.3.18 was more than two times as long as in v0.2.5.

topaz train --train-images /path/to/image_list_train.txt --train-targets /path/to/topaz_particles_processed_train.txt \
    -s 0 -p 0 --test-images /path/to/image_list_test.txt --test-targets /path/to//topaz_particles_processed_test.txt \
    --num-particles 500 --learning-rate 0.0002 --minibatch-size 128 --num-epochs 10 --method GE-binomial \
    --slack -1.0 --autoencoder 0.0 --l2 0.0 --minibatch-balance 0.0625 --epoch-size 5000 --model resnet8 \
    --units 32 --dropout 0.0 --bn on --unit-scaling 2 --ngf 32 --num-workers 1 \
    --cross-validation-seed 1039026690 --radius 3 --num-particles 500 --device 0 --no-pretrained \
    --save-prefix=/path/to/models/model -o /path/to/train_test_curve.txt

With this command, we observed in v0.3 a notification

When using GPU to load data, we only load in this process. Setting num_workers = 0.

(in case this is related.)
In the netdata trace of GPU load

Image the narrower, higher plateau corresponds to a training run with v0.2.5. The subsequent wider, shallower plateau corresponds to the equivalent v0.3.0 run. Is there a combination of parameters that would allow us to replicate in v0.3.18 the speed and approximate results of a v0.2.5 run?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions