Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

why is srun needed? #41

Open
hvgazula opened this issue Feb 16, 2024 · 7 comments
Open

why is srun needed? #41

hvgazula opened this issue Feb 16, 2024 · 7 comments

Comments

@hvgazula
Copy link
Collaborator

just curious..why is srun needed here?

srun python -u scripts/commands/main.py train --logdir='20240204-multi-4gpu-Msegformer\Smed\Ldice\C51\B512\A0' --num_epochs=1000 --batch_size=512 --model_name='segformer' --nr_of_classes=51 --lr=5e-5 --data_size='med'

@sabeenlohawala
Copy link
Owner

We looked into this in December I think, but srun is used to run a parallel job: https://slurm.schedmd.com/srun.html.

@hvgazula
Copy link
Collaborator Author

but there's only one job..my submit.sh doesn't have srun..does it?

@hvgazula
Copy link
Collaborator Author

also any reason why the arguments are placed inside the submission script but not outside?https://github.com/sabeenlohawala/tissue_labeling/blob/dev/submit_requeue.sh

@hvgazula
Copy link
Collaborator Author

if requeue is for resumption..those parameters can be read from config file..correct? or is It that you except requeue for train as well?

@sabeenlohawala
Copy link
Owner

but there's only one job..my submit.sh doesn't have srun..does it?

Yes, there's only one job but because the job uses multiple gpus the srun command is used to create the multiple processes (from what I understand). Your submit.sh doesn't have it, but Matthias's did. In December, we said that I should modify the submit.sh to have the srun command.

also any reason why the arguments are placed inside the submission script but not outside?https://github.com/sabeenlohawala/tissue_labeling/blob/dev/submit_requeue.sh

I did this to make sure that if the job was requeued, the correct arguments would still be passed so the correct logdir and its checkpoints could be found if the job is resumed. If the arguments are passed through the command line / makefile to the .sh script are they remembered when they're requeued?

if requeue is for resumption..those parameters can be read from config file..correct? or is It that you except requeue for train as well?

Yes, if the job is requeued it just finds the latest checkpoint and reads in the arguments from the config, but the naming convention of the logdir depends on the values for these arguments.

@hvgazula
Copy link
Collaborator Author

@hvgazula
Copy link
Collaborator Author

Now I understand why I have train and resume-train. If only I knew about the --requeue I would have combined the two flags by modifying the update_config function to work on the existence of a logdir instead of the flag (train|retrain). Anyway, good to know. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants