Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When checking the job status crashes, do not force empty status, keep… #178

Closed
wants to merge 1 commit into from

Conversation

ftorradeflot
Copy link

The call to check the status of the job may eventually crash and the reason won't in general be related to the status of the job. In case checking the job status crashes, it makes more sense to keep the previous one.

Our HTCondor batch system may become overloaded for some minutes from time to time. During these periods the condor_q command will always fail and with the current implementation it leads to a submit-kill loop.

@cmd-ntrf
Copy link
Contributor

This issue was also reported in #171.

The solution is a bit more complex than you are currently proposing. If we simply keep the job_state as it is when the state command fails and the get state command always fail, batchspawner will maintain spawner alive forever.

This can happen for example with Slurm when a job runs out of time. squeue exit code is 1 when the job id is not in the queue, but also 1 when squeue fails to communicate with the ressource manager.

We would probably need a regex to determine if the state command is failing because of difficulty to communicate with the resource manager.

@consideRatio
Copy link
Member

Closed by #187!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants