Skip to content

Crash when rebooting gunicorn due to apm thread being stuck #1719

Open
@vtoupet

Description

@vtoupet

I am running Django coupled with gunicorn (gevent worker). When gunicorn worker is being auto-restarted (max-requests value is reached), APM client gets stuck most of the time:

Following is the stack trace:

[2023-01-05 13:32:29 +0000] [10] [INFO] Autorestarting worker after current request.
[2023-01-05 13:33:30 +0000] [10] [WARNING] Worker graceful timeout (pid:10)
Exception ignored in: <function AsyncResult.__del__ at 0x7f4e7c013520>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/celery/result.py", line 391, in __del__
    self.backend.remove_pending_result(self)
  File "/usr/local/lib/python3.10/site-packages/celery/backends/asynchronous.py", line 186, in remove_pending_result
    self.on_result_fulfilled(result)
  File "/usr/local/lib/python3.10/site-packages/celery/backends/asynchronous.py", line 194, in on_result_fulfilled
    self.result_consumer.cancel_for(result.id)
  File "/usr/local/lib/python3.10/site-packages/celery/backends/redis.py", line 175, in cancel_for
    self._pubsub.unsubscribe(key)
  File "/usr/local/lib/python3.10/site-packages/redis/client.py", line 3600, in unsubscribe
    return self.execute_command('UNSUBSCRIBE', *args)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/instrumentation/packages/base.py", line 211, in call_if_sampling
    return self.call(module, method, wrapped, instance, args, kwargs)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/instrumentation/packages/redis.py", line 65, in call
    with capture_span(wrapped_name, span_type="db", span_subtype="redis", span_action="query", leaf=True):
  File "/usr/local/lib/python3.10/site-packages/elasticapm/traces.py", line 1076, in __exit__
    self.handle_exit(exc_type, exc_val, exc_tb)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/traces.py", line 1103, in handle_exit
    span = transaction.end_span(self.skip_frames, duration=self.duration, outcome=outcome)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/traces.py", line 373, in end_span
    span.end(skip_frames=skip_frames, duration=duration)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/traces.py", line 698, in end
    p.child_ended(self)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/traces.py", line 116, in child_ended
    with self.compression_buffer_lock:
  File "src/gevent/_semaphore.py", line 278, in gevent._gevent_c_semaphore.Semaphore.__enter__
  File "src/gevent/_semaphore.py", line 279, in gevent._gevent_c_semaphore.Semaphore.__enter__
  File "src/gevent/_semaphore.py", line 180, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "/usr/local/lib/python3.10/site-packages/gevent/thread.py", line 121, in acquire
    acquired = BoundedSemaphore.acquire(self, blocking, timeout)
  File "src/gevent/_semaphore.py", line 180, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_semaphore.py", line 249, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
greenlet.GreenletExit: 
[2023-01-05 13:33:32 +0000] [10] [INFO] Worker exiting (pid: 10)
[2023-01-05 13:33:32 +0000] [160] [INFO] Booting worker with pid: 160

I suspect a problem of monkey patching (due to gevent worker being used). When I do not use apm, this is never stuck and gunicorn can autorestart without SIGKILL/SIGTERM.

I also got the following error which seems similar:

[2023-01-05 13:16:16 +0000] [12] [INFO] Autorestarting worker after current request.
[2023-01-05 13:17:16 +0000] [12] [WARNING] Worker graceful timeout (pid:12)
[2023-01-05 13:17:18 +0000] [12] [INFO] Worker exiting (pid: 12)
[2023-01-05 13:18:16 +0000] [1] [CRITICAL] WORKER TIMEOUT (pid:12)
Exception ignored in atexit callback: <bound method Client.close of <elasticapm.contrib.django.client.DjangoClient object at 0x7f4e6efece80>>
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/elasticapm/base.py", line 331, in close
    manager.stop_thread()
  File "/usr/local/lib/python3.10/site-packages/elasticapm/metrics/base_metrics.py", line 118, in stop_thread
    self.collect()
  File "/usr/local/lib/python3.10/site-packages/elasticapm/metrics/base_metrics.py", line 101, in collect
    self.client.queue(constants.METRICSET, data)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/base.py", line 288, in queue
    self._transport.queue(event_type, data, flush)
  File "/usr/local/lib/python3.10/site-packages/elasticapm/transport/base.py", line 106, in queue
    self._event_queue.put((event_type, data, flush), block=False, **kwargs)
  File "/usr/local/lib/python3.10/queue.py", line 133, in put
    with self.not_full:
  File "/usr/local/lib/python3.10/threading.py", line 265, in __enter__
    return self._lock.__enter__()
  File "src/gevent/_semaphore.py", line 278, in gevent._gevent_c_semaphore.Semaphore.__enter__
  File "src/gevent/_semaphore.py", line 279, in gevent._gevent_c_semaphore.Semaphore.__enter__
  File "src/gevent/_semaphore.py", line 180, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "/usr/local/lib/python3.10/site-packages/gevent/thread.py", line 121, in acquire
    acquired = BoundedSemaphore.acquire(self, blocking, timeout)
  File "src/gevent/_semaphore.py", line 180, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_semaphore.py", line 249, in gevent._gevent_c_semaphore.Semaphore.acquire
  File "src/gevent/_abstract_linkable.py", line 521, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait
  File "src/gevent/_abstract_linkable.py", line 487, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 490, in gevent._gevent_c_abstract_linkable.AbstractLinkable._wait_core
  File "src/gevent/_abstract_linkable.py", line 442, in gevent._gevent_c_abstract_linkable.AbstractLinkable._AbstractLinkable__wait_to_be_notified
  File "src/gevent/_abstract_linkable.py", line 451, in gevent._gevent_c_abstract_linkable.AbstractLinkable._switch_to_hub
  File "src/gevent/_greenlet_primitives.py", line 61, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_greenlet_primitives.py", line 65, in gevent._gevent_c_greenlet_primitives.SwitchOutGreenletWithLoop.switch
  File "src/gevent/_gevent_c_greenlet_primitives.pxd", line 35, in gevent._gevent_c_greenlet_primitives._greenlet_switch
  File "/usr/local/lib/python3.10/site-packages/gunicorn/workers/base.py", line 203, in handle_abort
    sys.exit(1)
SystemExit: 1

The command I use to launch gunicorn is the following:

gunicorn config.wsgi:application --bind 0.0.0.0:8000 \
  --worker-class=gevent \
  --worker-connections=100 \
  --workers=${GUNICORN_WORKERS} \
  --timeout=60 \
  --graceful-timeout=60 \
  --max-requests=500

Environment

  • OS: Linux Ubuntu 20.04
  • Python version: 3.10.6
  • Framework and version: Django 3.2
  • APM Server version: 8.5.2
  • Agent version: 6.13.2

Additional context

  • Agent config options

    Click to expand
    ELASTIC_APM = dict()
    ELASTIC_APM['DEBUG'] = False
    ELASTIC_APM['ENVIRONMENT'] = APP_ENVIRONMENT
    ELASTIC_APM['SECRET_TOKEN'] = ###
    ELASTIC_APM['SERVER_URL'] = env.str('APM_HOST', default='http://localhost:8200')
    ELASTIC_APM['DJANGO_TRANSACTION_NAME_FROM_ROUTE'] = True
    ELASTIC_APM['SERVICE_NAME'] = env.str('APM_SERVICE_NAME', default='Django')
    
    APM_LOGGING_HANDLER_NAME = 'elasticapm'
    LOGGING['handlers'].update(**{
        APM_LOGGING_HANDLER_NAME: {
            'level': 'WARNING',
            'class': 'elasticapm.contrib.django.handlers.LoggingHandler',
        }
    })
    for logger in LOGGING['loggers'].keys():
        LOGGING['loggers'][logger]['handlers'].append(APM_LOGGING_HANDLER_NAME)
    
  • requirements.txt:

    Click to expand
    Django==3.2.16
    gunicorn==20.1.0
    gevent==21.8.0
    
    django-cors-headers==3.13.0
    psycopg2-binary==2.9.5
    djangorestframework==3.14.0
    djangorestframework-gis==1.0
    celery==5.1.2
    redis==3.5.3
    django-redis==5.2.0
    django-environ==0.9.0
    django-storages==1.13.2
    sentry-sdk==1.12.1
    elastic-apm==6.13.2
    python-logstash-async==2.5.0
    django-cleanup==6.0.0
    google-cloud-pubsub==2.13.11
    pyfcm==1.5.4
    pymongo==4.3.3
    packaging>=20.9
    

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions