-
Notifications
You must be signed in to change notification settings - Fork 134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
When the Scheduler/RM fails? #171
Comments
I agree this would be very useful. Under heavy load, our SLURM controller sometimes doesn't respond very well such that users lose connection to their servers. Maybe a simple while loop would already help retrying a query/cancel command a number of times with increasing delay before throwing an error. |
I think I ran into this issue as well. |
I don't mind the polling scheme, but if you do get a response from the single user server, it seems that would be a nice method. I'm not sure how much I'll dive into this for the code basis. I also like the method of trying a query with a back off time before throwing an error. This issue is getting a bit more painful for our users, so a quick and dirty wrap of |
I have written a draft PR #179 to address the problem. |
Closed by #187! |
Many schedulers and their sibling daemons are designed such that the controller can be failed for a set amount of time where the compute node daemons will continue to run the job even though the controller cannot be connected to. Is there any native ability w/in batchspawner to make attempts to try to query "failed" (i.e., non-zero exit status) commands again.
There are at least two edge cases that I can think of:
When the controller can't be talked to the proxy information get's removed and the user loses access to their notebook even though their job may very well continue on despite the hub thinking the job doesn't exist.
When the JupyterHub process tries to cancel a job, but cannot complete, the job may very well continue to run, but with the state information removed from the database, the user would lose access to the job.
Both are interesting cases and I want to configure the environment to be a little bit more resilient from the scheduler.
The quick approach is to override the query/submit/cancel commands as part of the configuration, but I'm also curious if anybody else has these issues or thought of them at this point?
The text was updated successfully, but these errors were encountered: