Description
Hello pyslurm developers,
I work on an HPC performance tool for my university. We want to enable the tool to dispatch measurement executions of a target code to our cluster, which uses SLURM. Ideally, we want to use pyslurm for this.
What we need is a way to:
- Dispatch jobs to the cluster: Already possible with
job.submit_batch_job
. - Wait for a job to finish, so that we can examine the results. So ideally something like a blocking method
job.wait(job_id)
would be nice, which you could call to wait for a job (referenced by the job_id) to finish.
I'm a pyslurm newbie, but as far as I understand, there is no such thing in pyslurm at the moment. As far as I understand there would be several possibilities building such behavior with some combinations of thefind
,find_id
andget
methods from the job class.
How do you think would be the approach to do this? Would you think it would be applicable to build such behavior into pyslurm? Or that this is a thing that our tool should care about?
I have to dive deeper into the code, but if there is a thing on this topic I can help with, I would be happy to do so. Generally, we would like to offer to contribute back our knowledge we may obtain during the process, if it is in code or not. It would maybe also be a possibility just to see how it turns out on our side, and we contribute back our code/interface we developed, or even just some comments for others on how we did it.
Thanks for doing this great project, I'm exited to hear your thoughts!
Best,
Jonathan