Skip to content

Conversation

@gjgetzinger
Copy link

Detect oom-killed jobs in slurm out files. Get array index of oom-killed jobs. Read and reset job array scripts with new memory settings. Resubmit array jobs with new memory settings. Allows for placing slurm_apply jobs in while loops to handle jobs with unpredictable memory loading.

sjob <- slurm_apply(
  f = sum_function(input[i]), 
  params = data.frame(i = 1:length(input))
)
array_index <- get_oomKill_arrayindex(slr_job = sjob)
while(length(array_index) > 0 & mem <= max_mem){
 mem <- mem + mem_step
 update_sbatch_script(
  slr_job = sjob, 
  array_index = array_index, 
  mem = mem, 
  write_script = T, 
  submit_job = T
  )
 array_index <- get_oomKill_arrayindex(slr_job = sjob)
 if(is.null(array_index)){break}
}

Detect oom-killed jobs in slurm out files. Get array index of oom-killed jobs. Read and reset job array scripts with new memory settings. Resubmit array jobs with new memory settings. Allows for placing slurm_apply jobs in while loops to handle jobs with unpredictable memory loading.
@qdread
Copy link
Contributor

qdread commented Nov 12, 2019

Hi @gjgetzinger thanks so much for your PR. Sorry for the very late response! We are going to hold off on merging this for the moment. I think it's a good idea to enable resubmission of a subset of jobs that failed but we just want to think about how exactly to implement it. I will leave it open for now pending further discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants