why is srun needed? #41

hvgazula · 2024-02-16T20:26:38Z

just curious..why is srun needed here?

Line 20 in f88c8a0

    
           srun python -u scripts/commands/main.py train --logdir='20240204-multi-4gpu-Msegformer\Smed\Ldice\C51\B512\A0' --num_epochs=1000 --batch_size=512 --model_name='segformer' --nr_of_classes=51 --lr=5e-5 --data_size='med'

The text was updated successfully, but these errors were encountered:

sabeenlohawala · 2024-02-16T22:19:55Z

We looked into this in December I think, but srun is used to run a parallel job: https://slurm.schedmd.com/srun.html.

hvgazula · 2024-02-16T22:26:23Z

but there's only one job..my submit.sh doesn't have srun..does it?

hvgazula · 2024-02-16T22:30:41Z

also any reason why the arguments are placed inside the submission script but not outside?https://github.com/sabeenlohawala/tissue_labeling/blob/dev/submit_requeue.sh

hvgazula · 2024-02-16T22:32:20Z

if requeue is for resumption..those parameters can be read from config file..correct? or is It that you except requeue for train as well?

sabeenlohawala · 2024-02-17T14:54:12Z

but there's only one job..my submit.sh doesn't have srun..does it?

Yes, there's only one job but because the job uses multiple gpus the srun command is used to create the multiple processes (from what I understand). Your submit.sh doesn't have it, but Matthias's did. In December, we said that I should modify the submit.sh to have the srun command.

also any reason why the arguments are placed inside the submission script but not outside?https://github.com/sabeenlohawala/tissue_labeling/blob/dev/submit_requeue.sh

I did this to make sure that if the job was requeued, the correct arguments would still be passed so the correct logdir and its checkpoints could be found if the job is resumed. If the arguments are passed through the command line / makefile to the .sh script are they remembered when they're requeued?

if requeue is for resumption..those parameters can be read from config file..correct? or is It that you except requeue for train as well?

Yes, if the job is requeued it just finds the latest checkpoint and reads in the arguments from the config, but the naming convention of the logdir depends on the values for these arguments.

hvgazula · 2024-02-21T20:30:59Z

https://github.com/MGH-LEMoN/Photo-SynthSeg/blob/fixed-spacing/Makefile#L370

hvgazula · 2024-02-21T20:41:50Z

Now I understand why I have train and resume-train. If only I knew about the --requeue I would have combined the two flags by modifying the update_config function to work on the existence of a logdir instead of the flag (train|retrain). Anyway, good to know. Thanks.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

why is srun needed? #41

why is srun needed? #41

hvgazula commented Feb 16, 2024

sabeenlohawala commented Feb 16, 2024

hvgazula commented Feb 16, 2024

hvgazula commented Feb 16, 2024

hvgazula commented Feb 16, 2024

sabeenlohawala commented Feb 17, 2024

hvgazula commented Feb 21, 2024

hvgazula commented Feb 21, 2024

why is srun needed? #41

why is srun needed? #41

Comments

hvgazula commented Feb 16, 2024

sabeenlohawala commented Feb 16, 2024

hvgazula commented Feb 16, 2024

hvgazula commented Feb 16, 2024

hvgazula commented Feb 16, 2024

sabeenlohawala commented Feb 17, 2024

hvgazula commented Feb 21, 2024

hvgazula commented Feb 21, 2024