Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When the Scheduler/RM fails? #171

Closed
jbaksta opened this issue Dec 31, 2019 · 5 comments
Closed

When the Scheduler/RM fails? #171

jbaksta opened this issue Dec 31, 2019 · 5 comments

Comments

@jbaksta
Copy link

jbaksta commented Dec 31, 2019

Many schedulers and their sibling daemons are designed such that the controller can be failed for a set amount of time where the compute node daemons will continue to run the job even though the controller cannot be connected to. Is there any native ability w/in batchspawner to make attempts to try to query "failed" (i.e., non-zero exit status) commands again.

There are at least two edge cases that I can think of:

  1. When the controller can't be talked to the proxy information get's removed and the user loses access to their notebook even though their job may very well continue on despite the hub thinking the job doesn't exist.

  2. When the JupyterHub process tries to cancel a job, but cannot complete, the job may very well continue to run, but with the state information removed from the database, the user would lose access to the job.

Both are interesting cases and I want to configure the environment to be a little bit more resilient from the scheduler.

The quick approach is to override the query/submit/cancel commands as part of the configuration, but I'm also curious if anybody else has these issues or thought of them at this point?

@joschaschmiedt
Copy link

I agree this would be very useful. Under heavy load, our SLURM controller sometimes doesn't respond very well such that users lose connection to their servers.

Maybe a simple while loop would already help retrying a query/cancel command a number of times with increasing delay before throwing an error.

@Hoeze
Copy link
Contributor

Hoeze commented Mar 30, 2020

I think I ran into this issue as well.
How about only querying the job state when the jupyterhub-singleuser stops notifying the jupyterhub for some time?

@jbaksta
Copy link
Author

jbaksta commented Apr 8, 2020

I don't mind the polling scheme, but if you do get a response from the single user server, it seems that would be a nice method. I'm not sure how much I'll dive into this for the code basis.

I also like the method of trying a query with a back off time before throwing an error.

This issue is getting a bit more painful for our users, so a quick and dirty wrap of squeue may be in short order to alleviate (not fix) a bit of the experience.

@cmd-ntrf
Copy link
Contributor

I have written a draft PR #179 to address the problem.
Feedback welcomed.

@consideRatio
Copy link
Member

Closed by #187!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants