Description
What happened?
In scenarios where we have recently manually killed a job (via kill -TERM
on the worker VM or similar), clicking on the job ID on the batch UI page to go to the particular job page instead results in 500 Internal Server Error
.
Finding the server logs (see below) indicates that the problem is a record in the attempts
database table with both start_time
and end_time
being NULL. Looking in the database shows that the record in question is the one for the next yet-to-be-started attempt, not the attempt that has just been killed (which in our observations has had end_time
at least filled in).
This could be addressed by ensuring that at least one of these fields is always non-NULL, or more likely by making the attempts.sort(…)
invocation more robust, e.g., via
attempts.sort(key=lambda x: x['start_time'] or x['end_time'] or MAXINT)
(where MAXINT
is a suitable value to make these entries sort last)
Version
0.2.133
Relevant log output
{"severity":"ERROR","levelname":"ERROR","asctime":"2024-12-04 22:51:26,136","filename":"web_protocol.py","funcNameAndLine":"log_exception:421","message":"Error handling request","exc_info":"Traceback (most recent call last):
File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_protocol.py\", line 452, in _handle_request
resp = await request_handler(request)
File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_app.py\", line 543, in _handle
resp = await handler(request)
File \"/usr/local/lib/python3.9/dist-packages/aiohttp/web_middlewares.py\", line 114, in impl
return await handler(request)
File \"/usr/local/lib/python3.9/dist-packages/gear/csrf.py\", line 27, in check_csrf_token
return await handler(request)
File \"/usr/local/lib/python3.9/dist-packages/batch/utils.py\", line 19, in unavailable_if_frozen
return await handler(request)
File \"/usr/local/lib/python3.9/dist-packages/gear/metrics.py\", line 28, in monitor_endpoints_middleware
response = await prom_async_time(REQUEST_TIME.labels(endpoint=endpoint, verb=verb), handler(request)) # type: ignore
File \"/usr/local/lib/python3.9/dist-packages/prometheus_async/aio/_decorators.py\", line 55, in measure
rv = await future
File \"/usr/local/lib/python3.9/dist-packages/aiohttp_session/__init__.py\", line 199, in factory
response = await handler(request)
File \"/usr/local/lib/python3.9/dist-packages/gear/auth.py\", line 68, in wrapped
return await fun(request, userdata)
File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 202, in wrapped
return await fun(request, userdata, batch_id)
File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 163, in wrapped
return await fun(request, userdata, *args, **kwargs)
File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2940, in ui_get_job
job, attempts, job_log_bytes, resource_usage = await asyncio.gather(
File \"/usr/local/lib/python3.9/dist-packages/batch/front_end/front_end.py\", line 2640, in _get_attempts
attempts.sort(key=lambda x: x['start_time'] or x['end_time'])
TypeError: '<' not supported between instances of 'NoneType' and 'int'","hail_log":1}