-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
why is srun needed? #41
Comments
We looked into this in December I think, but srun is used to run a parallel job: https://slurm.schedmd.com/srun.html. |
but there's only one job..my submit.sh doesn't have srun..does it? |
also any reason why the arguments are placed inside the submission script but not outside?https://github.com/sabeenlohawala/tissue_labeling/blob/dev/submit_requeue.sh |
if requeue is for resumption..those parameters can be read from config file..correct? or is It that you except requeue for train as well? |
Yes, there's only one job but because the job uses multiple gpus the srun command is used to create the multiple processes (from what I understand). Your submit.sh doesn't have it, but Matthias's did. In December, we said that I should modify the submit.sh to have the srun command.
I did this to make sure that if the job was requeued, the correct arguments would still be passed so the correct logdir and its checkpoints could be found if the job is resumed. If the arguments are passed through the command line / makefile to the .sh script are they remembered when they're requeued?
Yes, if the job is requeued it just finds the latest checkpoint and reads in the arguments from the config, but the naming convention of the logdir depends on the values for these arguments. |
Now I understand why I have |
just curious..why is
srun
needed here?tissue_labeling/submit_multi_gpu.sh
Line 20 in f88c8a0
The text was updated successfully, but these errors were encountered: