Skip to content

batch jobs that started but failed before launching the server leave the hub on unrecoverable state #253

@lexming

Description

@lexming

Bug description

Once the job scheduler reports that the job is running, batchspawner waits indefinitely to get the server port from the single-user server.

while self.port == 0:
await gen.sleep(self.startup_poll_interval)
# Test framework: For testing, mock_port is set because we
# don't actually run the single-user server yet.
if hasattr(self, "mock_port"):
self.port = self.mock_port

If the job fails to launch the server for any reason and stops, batchspawner will continue to wait until start_timeout is reached.

Expected behaviour

Batchspawner should catch failed jobs even it they already started and abort Spawner.start().

Actual behaviour

JupyterHub waits until start_timeout and prints the usual timeout error. However, at this point it is impossible to start another server. Stopping the server does not work (as it is not running) and the hub will show a popup saying that a new server cannot be started because one is already pending. JupyterHub has to be restarted.

How to reproduce

  1. make a test batch_script with exit 1 as the first script command
  2. launch the server as usual with batchspawner

Your personal set up

  • Version(s):
    • Python v3.8.10
    • Jupyterhub v2.3.1
    • Batchspawner v1.2.0
Full environment
alembic @ file:///tmp/wheelhouse/alembic-1.8.0-py3-none-any.whl
async-generator @ file:///tmp/wheelhouse/async_generator-1.10-py3-none-any.whl
attrs @ file:///tmp/wheelhouse/attrs-21.4.0-py2.py3-none-any.whl
batchspawner==1.2.0
certifi @ file:///tmp/wheelhouse/certifi-2022.5.18.1-py3-none-any.whl
certipy @ file:///tmp/wheelhouse/certipy-0.1.3-py3-none-any.whl
cffi @ file:///tmp/wheelhouse/cffi-1.15.0-cp38-cp38-manylinux_2_12_x86_64.manylinux2010_x86_64.whl
charset-normalizer @ file:///tmp/wheelhouse/charset_normalizer-2.0.12-py3-none-any.whl
cryptography @ file:///tmp/wheelhouse/cryptography-37.0.2-cp36-abi3-manylinux_2_24_x86_64.whl
entrypoints @ file:///tmp/wheelhouse/entrypoints-0.4-py3-none-any.whl
future==0.18.2
greenlet @ file:///tmp/wheelhouse/greenlet-1.1.2-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
idna @ file:///tmp/wheelhouse/idna-3.3-py3-none-any.whl
importlib-metadata @ file:///tmp/wheelhouse/importlib_metadata-4.11.4-py3-none-any.whl
importlib-resources @ file:///tmp/wheelhouse/importlib_resources-5.7.1-py3-none-any.whl
Jinja2 @ file:///tmp/wheelhouse/Jinja2-3.1.2-py3-none-any.whl
jsonschema @ file:///tmp/wheelhouse/jsonschema-4.6.0-py3-none-any.whl
jupyter-telemetry @ file:///tmp/wheelhouse/jupyter_telemetry-0.1.0-py3-none-any.whl
jupyterhub @ file:///tmp/wheelhouse/jupyterhub-2.3.1-py3-none-any.whl
jupyterhub-moss @ https://github.com/vub-hpc/jupyterhub_moss/archive/refs/tags/v5.5.1.tar.gz
Mako @ file:///tmp/wheelhouse/Mako-1.2.0-py3-none-any.whl
MarkupSafe @ file:///tmp/wheelhouse/MarkupSafe-2.1.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
mock==4.0.3
oauthenticator==15.1.0
oauthlib @ file:///tmp/wheelhouse/oauthlib-3.2.0-py3-none-any.whl
packaging @ file:///tmp/wheelhouse/packaging-21.3-py3-none-any.whl
pamela @ file:///tmp/wheelhouse/pamela-1.0.0-py2.py3-none-any.whl
prometheus-client @ file:///tmp/wheelhouse/prometheus_client-0.14.1-py3-none-any.whl
pycparser @ file:///tmp/wheelhouse/pycparser-2.21-py2.py3-none-any.whl
pycurl==7.43.0.2
pyOpenSSL @ file:///tmp/wheelhouse/pyOpenSSL-22.0.0-py2.py3-none-any.whl
pyparsing @ file:///tmp/wheelhouse/pyparsing-3.0.9-py3-none-any.whl
pyrsistent @ file:///tmp/wheelhouse/pyrsistent-0.18.1-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
python-dateutil @ file:///tmp/wheelhouse/python_dateutil-2.8.2-py2.py3-none-any.whl
python-json-logger @ file:///tmp/wheelhouse/python_json_logger-2.0.2-py3-none-any.whl
requests @ file:///tmp/wheelhouse/requests-2.27.1-py2.py3-none-any.whl
ruamel.yaml @ file:///tmp/wheelhouse/ruamel.yaml-0.17.21-py3-none-any.whl
ruamel.yaml.clib @ file:///tmp/wheelhouse/ruamel.yaml.clib-0.2.6-cp38-cp38-manylinux1_x86_64.whl
six @ file:///tmp/wheelhouse/six-1.16.0-py2.py3-none-any.whl
SQLAlchemy @ file:///tmp/wheelhouse/SQLAlchemy-1.4.37-cp38-cp38-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl
tornado @ file:///tmp/wheelhouse/tornado-6.1-cp38-cp38-manylinux2010_x86_64.whl
traitlets @ file:///tmp/wheelhouse/traitlets-5.2.2.post1-py3-none-any.whl
urllib3 @ file:///tmp/wheelhouse/urllib3-1.26.9-py2.py3-none-any.whl
vsc-base==3.4.9
vsc-config @ file:///usr/local/src/vsc-config-master.tar.gz
vsc-install==0.17.26
zipp @ file:///tmp/wheelhouse/zipp-3.8.0-py3-none-any.whl
Configuration This issue is independent of any configuration settings.
Logs
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.019 JupyterHub user:728] Calling Spawner.start for vsc10122
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.023 JupyterHub spawner:313] Used environment:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.023 JupyterHub spawner:314] Used default URL: /lab
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.048 JupyterHub batchspawner:291] Spawner submitting job using  sbatch --parsable
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.049 JupyterHub batchspawner:292] Spawner submitted script:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #!/bin/bash
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --job-name=spawner-jupyterhub
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --chdir=/user/brussel/101/vsc10122
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --output=/dev/null
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --export=PATH,LANG,LC_ALL,JUPYTERHUB_API_TOKEN,JPY_API_TOKEN,JUPYTERHUB_CLIENT_ID,JUPYTERHUB_HOST,JUPYTERHUB_OAUTH_CALLBACK_URL,JUPYTERHUB_OAUTH_SCOPES,JUPYTERHUB_USER,JUPYTERHUB_SERVER_NAME,JUPYTERHUB_API_URL,JUPYTERHUB_ACTIVITY_URL,JUPYTERHUB_BASE_URL,JUPYTERHUB_SERVICE_PREFIX,JUPYTERHUB_SERVICE_URL,JUPYTERHUB_DEFAULT_URL,USER,HOME,SHELL
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --get-user-env=L
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --partition=broadwell
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --time=1:00:00
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     #SBATCH --cpus-per-task=1
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     set -euo pipefail
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     trap 'echo SIGTERM received' TERM
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     exit 1
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     module load JupyterHub/2.3.1-GCCcore-10.3.0
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     srun --export=ALL batchspawner-singleuser jupyterhub-singleuser
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     echo "JupyterLab server ended gracefully"
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:     EOF
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.639 JupyterHub batchspawner:295] Job submitted. cmd: sbatch --parsable  output: 7837730;hydra
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.648 JupyterHub batchspawner:322] Spawner querying job: squeue -h -j 7837730 -o \'%T %B\'
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [W 2022-12-14 14:05:26.796 JupyterHub base:188] Rolling back dirty objects IdentitySet([<Server(:0)>])
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.819 JupyterHub log:189] 302 POST /hub/spawn -> /hub/spawn-pending/vsc10122 (vsc10122@134.184.143.131) 1032.47ms
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.849 JupyterHub scopes:491] Checking access via scope servers
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:26.850 JupyterHub scopes:402] Argument-based access to /hub/spawn-pending/vsc10122 via servers
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.856 JupyterHub pages:401] vsc10122 is pending spawn
Dec 14 15:05:26 jupyterhub01.phoenix.os conmon[17742]: [I 2022-12-14 14:05:26.870 JupyterHub log:189] 200 GET /hub/spawn-pending/vsc10122 (vsc10122@134.184.143.131) 30.26ms
Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.199 JupyterHub batchspawner:437] Job 7837730 still pending
Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.339 JupyterHub scopes:491] Checking access via scope read:servers
Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.339 JupyterHub scopes:402] Argument-based access to /hub/api/users/vsc10122/server/progress via read:servers
Dec 14 15:05:27 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:05:27.706 JupyterHub batchspawner:322] Spawner querying job: squeue -h -j 7837730 -o \'%T %B\'
Dec 14 15:05:35 jupyterhub01.phoenix.os conmon[17742]: [W 2022-12-14 14:05:35.798 JupyterHub base:1063] User vsc10122 is slow to start (timeout=10)
Dec 14 15:10:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:10:09.386 JupyterHub proxy:821] Proxy: Fetching GET http://127.0.0.1:8001/api/routes
Dec 14 15:10:09 jupyterhub01.phoenix.os conmon[17742]: 14:10:09.400 [ConfigProxy] info: 200 GET /api/routes
Dec 14 15:10:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:10:09.409 JupyterHub proxy:346] Checking routes
Dec 14 15:15:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:15:09.386 JupyterHub proxy:821] Proxy: Fetching GET http://127.0.0.1:8001/api/routes
Dec 14 15:15:09 jupyterhub01.phoenix.os conmon[17742]: 14:15:09.397 [ConfigProxy] info: 200 GET /api/routes
Dec 14 15:15:09 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:15:09.401 JupyterHub proxy:346] Checking routes
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: [W 2022-12-14 14:15:26.023 JupyterHub user:807] vsc10122's server failed to start in 600 seconds, giving up.
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:     Common causes of this timeout, and debugging tips:
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:     1. Everything is working, but it took too long.
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:        To fix: increase `Spawner.start_timeout` configuration
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:        to a number of seconds that is enough for spawners to finish starting.
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:     2. The server didn't finish starting,
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:        or it crashed due to a configuration issue.
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:        Check the single-user server's logs for hints at what needs fixing.
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]:
Dec 14 15:15:26 jupyterhub01.phoenix.os conmon[17742]: [D 2022-12-14 14:15:26.033 JupyterHub user:913] Stopping vsc10122

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions