Skip to content

Internal Server Error due to attempt record with start_time/end_time both unset #14768

Open
@jmarshall

Description

What happened?

In scenarios where we have recently manually killed a job (via kill -TERM on the worker VM or similar), clicking on the job ID on the batch UI page to go to the particular job page instead results in 500 Internal Server Error.

Finding the server logs (see below) indicates that the problem is a record in the attempts database table with both start_time and end_time being NULL. Looking in the database shows that the record in question is the one for the next yet-to-be-started attempt, not the attempt that has just been killed (which in our observations has had end_time at least filled in).

This could be addressed by ensuring that at least one of these fields is always non-NULL, or more likely by making the attempts.sort(…) invocation more robust, e.g., via

attempts.sort(key=lambda x: x['start_time'] or x['end_time'] or MAXINT)

(where MAXINT is a suitable value to make these entries sort last)

Version

0.2.133

Relevant log output

{"severity":"ERROR","levelname":"ERROR","asctime":"2024-12-04 22:51:26,136","filename":"web_protocol.py","funcNameAndLine":"log_exception:421","message":"Error handling request","exc_info":"Traceback (most recent call last):
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_protocol.py\", line 452, in _handle_request
    resp = await request_handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_app.py\", line 543, in _handle
    resp = await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_middlewares.py\", line 114, in impl
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/csrf.py\", line 27, in check_csrf_token
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/batch/utils.py\", line 19, in unavailable_if_frozen
    return await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/metrics.py\", line 28, in monitor_endpoints_middleware
    response = await prom_async_time(REQUEST_TIME.labels(endpoint=endpoint, verb=verb), handler(request))  # type: ignore
  File \"/usr/local/lib/python3.9/dist-packages/prometheus_async/aio/_decorators.py\", line 55, in measure
    rv = await future
  File \"/usr/local/lib/python3.9/dist-packages/aiohttp_session/__init__.py\", line 199, in factory
    response = await handler(request)
  File \"/usr/local/lib/python3.9/dist-packages/gear/auth.py\", line 68, in wrapped
    return await fun(request, userdata)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 202, in wrapped
    return await fun(request, userdata, batch_id)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 163, in wrapped
    return await fun(request, userdata, *args, **kwargs)
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2940, in ui_get_job
    job, attempts, job_log_bytes, resource_usage = await asyncio.gather(
  File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2640, in _get_attempts
    attempts.sort(key=lambda x: x['start_time'] or x['end_time'])
TypeError: '<' not supported between instances of 'NoneType' and 'int'","hail_log":1}

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions