Skip to content

When the Scheduler/RM fails? #171

Closed
Closed
@jbaksta

Description

@jbaksta

Many schedulers and their sibling daemons are designed such that the controller can be failed for a set amount of time where the compute node daemons will continue to run the job even though the controller cannot be connected to. Is there any native ability w/in batchspawner to make attempts to try to query "failed" (i.e., non-zero exit status) commands again.

There are at least two edge cases that I can think of:

  1. When the controller can't be talked to the proxy information get's removed and the user loses access to their notebook even though their job may very well continue on despite the hub thinking the job doesn't exist.

  2. When the JupyterHub process tries to cancel a job, but cannot complete, the job may very well continue to run, but with the state information removed from the database, the user would lose access to the job.

Both are interesting cases and I want to configure the environment to be a little bit more resilient from the scheduler.

The quick approach is to override the query/submit/cancel commands as part of the configuration, but I'm also curious if anybody else has these issues or thought of them at this point?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions