Description
Many schedulers and their sibling daemons are designed such that the controller can be failed for a set amount of time where the compute node daemons will continue to run the job even though the controller cannot be connected to. Is there any native ability w/in batchspawner to make attempts to try to query "failed" (i.e., non-zero exit status) commands again.
There are at least two edge cases that I can think of:
-
When the controller can't be talked to the proxy information get's removed and the user loses access to their notebook even though their job may very well continue on despite the hub thinking the job doesn't exist.
-
When the JupyterHub process tries to cancel a job, but cannot complete, the job may very well continue to run, but with the state information removed from the database, the user would lose access to the job.
Both are interesting cases and I want to configure the environment to be a little bit more resilient from the scheduler.
The quick approach is to override the query/submit/cancel commands as part of the configuration, but I'm also curious if anybody else has these issues or thought of them at this point?