Skip to content

providers-fab: Unhandled MySQLdb.OperationalError in cleanup_session_middleware Session.remove() causes 500 on api-server #62335

@8silvergun

Description

@8silvergun

Apache Airflow Provider(s)

fab

Versions of Apache Airflow Providers

apache-airflow-providers-fab==3.3.0
apache-airflow-providers-common-sql==1.28.2
apache-airflow-providers-mysql==6.3.4
apache-airflow-providers-cncf-kubernetes==10.11.0
apache-airflow-providers-celery==3.13.0
apache-airflow-providers-standard==1.10.0

Apache Airflow version

3.1.6

Operating System

Debian 12 (bookworm) — official Airflow Docker image

Deployment

Official Apache Airflow Helm Chart

Deployment details

  • Kubernetes: Amazon EKS
  • Metadata DB: Amazon Aurora MySQL (MySQL 8.0 compatible)
  • MySQL wait_timeout: 28800 seconds (8 hours)
  • SQLAlchemy pool config: pool_recycle=60, pool_pre_ping=true, pool_size=3, max_overflow=2
  • api-server replicas: 2 pods
  • Airflow image: Custom image based on apache/airflow:3.1.6 with providers-fab==3.3.0

What happened

The cleanup_session_middleware introduced in PR #61480 (included in providers-fab 3.3.0) calls Session.remove() in a bare finally block without any error handling. When the underlying MySQL connection has been closed server-side (e.g., due to timeout, network interruption, or Aurora failover), Session.remove() internally attempts a ROLLBACK on the dead connection, which raises MySQLdb.OperationalError: (2006, 'Server has gone away').

This unhandled exception propagates up as a 500 Internal Server Error to the client, even though the original request may have completed successfully.

Error log from api-server pod:

2026-02-20T05:50:24.526091553Z [error    ] Exception in ASGI application [airflow.providers.fab.auth_manager.fab_auth_manager] loc=fab_auth_manager.py:243
Traceback (most recent call last):
  File ".../uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  ...
  File ".../airflow/providers/fab/auth_manager/fab_auth_manager.py", line 243, in cleanup_session_middleware
    settings.Session.remove()
  File ".../sqlalchemy/orm/scoping.py", line 246, in remove
    self.registry().close()
  File ".../sqlalchemy/orm/session.py", line 2081, in close
    self._close_impl(invalidate=False)
  File ".../sqlalchemy/orm/session.py", line 2124, in _close_impl
    self.rollback()
  ...
  File ".../MySQLdb/connections.py", line 260, in query
    _mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')

Relevant source code (fab_auth_manager.py, lines 235-243):

async def cleanup_session_middleware(request, call_next):
    try:
        response = await call_next(request)
        return response
    finally:
        from airflow import settings

        if settings.Session:
            settings.Session.remove()  # <-- unhandled exception here

The finally block does not catch exceptions from Session.remove(). Since this is a cleanup operation, any failure here should be logged and suppressed — not propagated to the client.

What you think should happen instead

Session.remove() in the finally block should be wrapped with suppress(Exception) to gracefully handle database connection errors during cleanup. The cleanup middleware's purpose is to prevent stale sessions — if cleanup itself fails because the connection is already dead, that's not an error that should affect the HTTP response.

Suggested fix:

async def cleanup_session_middleware(request, call_next):
    try:
        response = await call_next(request)
        return response
    finally:
        from airflow import settings
        if settings.Session:
            with suppress(Exception):
                settings.Session.remove()

This is consistent with the suppress(Exception) pattern already used in deserialize_user (PR #62153, merged 2026-02-19) for identical session cleanup error handling. The from contextlib import suppress import already exists in the file.

How to reproduce

  1. Deploy Airflow 3.1.6 with providers-fab==3.3.0 using MySQL (Aurora MySQL) as metadata DB
  2. Configure SQLAlchemy with pool_pre_ping=true and pool_recycle=60
  3. Have api-server running with multiple replicas
  4. Wait for a MySQL connection in the SQLAlchemy pool to become stale (connection closed server-side due to timeout, network issue, or Aurora maintenance)
  5. Send a request to the api-server (e.g., login via /auth/fab/v1/login) that triggers cleanup_session_middleware
  6. The stale connection causes Session.remove()ROLLBACKMySQLdb.OperationalError: (2006, 'Server has gone away') → 500 error

Note: This is timing-dependent and occurs intermittently. In our production environment, it appeared on 1 of 2 api-server pods. The issue is more likely to manifest with MySQL than PostgreSQL, since MySQL's Server has gone away error has no automatic retry at the driver level.

Anything else

Context — this is a follow-up to PR #61480:

PR #61480 correctly addressed the root cause of PendingRollbackError (issue #59349) by adding cleanup_session_middleware to ensure Session.remove() runs after every request. However, the finally block assumes Session.remove() always succeeds. When the DB connection is already dead, the cleanup itself fails and turns a successful request into a 500 error.

Impact:

  • Intermittent 500 errors on api-server login/UI pages
  • Self-recovers on retry (next request gets a fresh connection from the pool)
  • In our case: 2 occurrences over several days, both on the same pod

Related issues and PRs:
#59349 — Original PendingRollbackError issue that motivated PR #61480
#61480 — PR that introduced cleanup_session_middleware
#62153 — PR that established the suppress(Exception) pattern for session cleanup in deserialize_user (same class of problem, different code path)
#57470, #57859 — Earlier reports of the same session lifecycle problem

Environment evidence:

  • pool_pre_ping=true is enabled, which means SQLAlchemy validates connections before use — but Session.remove() bypasses this check since it operates on an already-bound session
  • MySQL wait_timeout=28800 (8h) and pool_recycle=60 should prevent most stale connections, but edge cases (Aurora failover, network blips) can still cause disconnections
Full error traceback
2026-02-20T05:50:24.526091553Z [error    ] Exception in ASGI application [airflow.providers.fab.auth_manager.fab_auth_manager] loc=fab_auth_manager.py:243
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/home/airflow/.local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
    return await self.app(scope, receive, send)
  File "/home/airflow/.local/lib/python3.12/site-packages/starlette/middleware/base.py", line 101, in __call__
    response = await self.dispatch_func(request, call_next)
  File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/fab/auth_manager/fab_auth_manager.py", line 243, in cleanup_session_middleware
    settings.Session.remove()
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/scoping.py", line 246, in remove
    self.registry().close()
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2081, in close
    self._close_impl(invalidate=False)
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2124, in _close_impl
    self.rollback()
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1982, in rollback
    self._transaction.rollback(_to_root=True)
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1040, in rollback
    self._connection_rollback(self._connections[transaction])
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1092, in _connection_rollback
    connection.rollback()
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1065, in rollback
    self._transaction.rollback()
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1768, in rollback
    self.connection._rollback_impl()
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 902, in _rollback_impl
    self._handle_dbapi_exception(e, None, None, None, None)
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2240, in _handle_dbapi_exception
    raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
  File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 899, in _rollback_impl
    self.connection.dbapi_connection.rollback()
  File "/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", line 272, in rollback
    self.query("ROLLBACK")
  File "/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", line 260, in query
    _mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions