-
Notifications
You must be signed in to change notification settings - Fork 16.6k
Description
Apache Airflow Provider(s)
fab
Versions of Apache Airflow Providers
apache-airflow-providers-fab==3.3.0
apache-airflow-providers-common-sql==1.28.2
apache-airflow-providers-mysql==6.3.4
apache-airflow-providers-cncf-kubernetes==10.11.0
apache-airflow-providers-celery==3.13.0
apache-airflow-providers-standard==1.10.0
Apache Airflow version
3.1.6
Operating System
Debian 12 (bookworm) — official Airflow Docker image
Deployment
Official Apache Airflow Helm Chart
Deployment details
- Kubernetes: Amazon EKS
- Metadata DB: Amazon Aurora MySQL (MySQL 8.0 compatible)
- MySQL
wait_timeout: 28800 seconds (8 hours) - SQLAlchemy pool config:
pool_recycle=60,pool_pre_ping=true,pool_size=3,max_overflow=2 - api-server replicas: 2 pods
- Airflow image: Custom image based on
apache/airflow:3.1.6withproviders-fab==3.3.0
What happened
The cleanup_session_middleware introduced in PR #61480 (included in providers-fab 3.3.0) calls Session.remove() in a bare finally block without any error handling. When the underlying MySQL connection has been closed server-side (e.g., due to timeout, network interruption, or Aurora failover), Session.remove() internally attempts a ROLLBACK on the dead connection, which raises MySQLdb.OperationalError: (2006, 'Server has gone away').
This unhandled exception propagates up as a 500 Internal Server Error to the client, even though the original request may have completed successfully.
Error log from api-server pod:
2026-02-20T05:50:24.526091553Z [error ] Exception in ASGI application [airflow.providers.fab.auth_manager.fab_auth_manager] loc=fab_auth_manager.py:243
Traceback (most recent call last):
File ".../uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
result = await app(self.scope, self.receive, self.send)
...
File ".../airflow/providers/fab/auth_manager/fab_auth_manager.py", line 243, in cleanup_session_middleware
settings.Session.remove()
File ".../sqlalchemy/orm/scoping.py", line 246, in remove
self.registry().close()
File ".../sqlalchemy/orm/session.py", line 2081, in close
self._close_impl(invalidate=False)
File ".../sqlalchemy/orm/session.py", line 2124, in _close_impl
self.rollback()
...
File ".../MySQLdb/connections.py", line 260, in query
_mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')
Relevant source code (fab_auth_manager.py, lines 235-243):
async def cleanup_session_middleware(request, call_next):
try:
response = await call_next(request)
return response
finally:
from airflow import settings
if settings.Session:
settings.Session.remove() # <-- unhandled exception hereThe finally block does not catch exceptions from Session.remove(). Since this is a cleanup operation, any failure here should be logged and suppressed — not propagated to the client.
What you think should happen instead
Session.remove() in the finally block should be wrapped with suppress(Exception) to gracefully handle database connection errors during cleanup. The cleanup middleware's purpose is to prevent stale sessions — if cleanup itself fails because the connection is already dead, that's not an error that should affect the HTTP response.
Suggested fix:
async def cleanup_session_middleware(request, call_next):
try:
response = await call_next(request)
return response
finally:
from airflow import settings
if settings.Session:
with suppress(Exception):
settings.Session.remove()This is consistent with the suppress(Exception) pattern already used in deserialize_user (PR #62153, merged 2026-02-19) for identical session cleanup error handling. The from contextlib import suppress import already exists in the file.
How to reproduce
- Deploy Airflow 3.1.6 with
providers-fab==3.3.0using MySQL (Aurora MySQL) as metadata DB - Configure SQLAlchemy with
pool_pre_ping=trueandpool_recycle=60 - Have api-server running with multiple replicas
- Wait for a MySQL connection in the SQLAlchemy pool to become stale (connection closed server-side due to timeout, network issue, or Aurora maintenance)
- Send a request to the api-server (e.g., login via
/auth/fab/v1/login) that triggerscleanup_session_middleware - The stale connection causes
Session.remove()→ROLLBACK→MySQLdb.OperationalError: (2006, 'Server has gone away')→ 500 error
Note: This is timing-dependent and occurs intermittently. In our production environment, it appeared on 1 of 2 api-server pods. The issue is more likely to manifest with MySQL than PostgreSQL, since MySQL's Server has gone away error has no automatic retry at the driver level.
Anything else
Context — this is a follow-up to PR #61480:
PR #61480 correctly addressed the root cause of PendingRollbackError (issue #59349) by adding cleanup_session_middleware to ensure Session.remove() runs after every request. However, the finally block assumes Session.remove() always succeeds. When the DB connection is already dead, the cleanup itself fails and turns a successful request into a 500 error.
Impact:
- Intermittent 500 errors on api-server login/UI pages
- Self-recovers on retry (next request gets a fresh connection from the pool)
- In our case: 2 occurrences over several days, both on the same pod
Related issues and PRs:
#59349 — Original PendingRollbackError issue that motivated PR #61480
#61480 — PR that introduced cleanup_session_middleware
#62153 — PR that established the suppress(Exception) pattern for session cleanup in deserialize_user (same class of problem, different code path)
#57470, #57859 — Earlier reports of the same session lifecycle problem
Environment evidence:
pool_pre_ping=trueis enabled, which means SQLAlchemy validates connections before use — butSession.remove()bypasses this check since it operates on an already-bound session- MySQL
wait_timeout=28800(8h) andpool_recycle=60should prevent most stale connections, but edge cases (Aurora failover, network blips) can still cause disconnections
Full error traceback
2026-02-20T05:50:24.526091553Z [error ] Exception in ASGI application [airflow.providers.fab.auth_manager.fab_auth_manager] loc=fab_auth_manager.py:243
Traceback (most recent call last):
File "/home/airflow/.local/lib/python3.12/site-packages/uvicorn/protocols/http/h11_impl.py", line 406, in run_asgi
result = await app(self.scope, self.receive, self.send)
File "/home/airflow/.local/lib/python3.12/site-packages/uvicorn/middleware/proxy_headers.py", line 60, in __call__
return await self.app(scope, receive, send)
File "/home/airflow/.local/lib/python3.12/site-packages/starlette/middleware/base.py", line 101, in __call__
response = await self.dispatch_func(request, call_next)
File "/home/airflow/.local/lib/python3.12/site-packages/airflow/providers/fab/auth_manager/fab_auth_manager.py", line 243, in cleanup_session_middleware
settings.Session.remove()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/scoping.py", line 246, in remove
self.registry().close()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2081, in close
self._close_impl(invalidate=False)
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 2124, in _close_impl
self.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1982, in rollback
self._transaction.rollback(_to_root=True)
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1040, in rollback
self._connection_rollback(self._connections[transaction])
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/orm/session.py", line 1092, in _connection_rollback
connection.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1065, in rollback
self._transaction.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 1768, in rollback
self.connection._rollback_impl()
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 902, in _rollback_impl
self._handle_dbapi_exception(e, None, None, None, None)
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 2240, in _handle_dbapi_exception
raise sqlalchemy_exception.with_traceback(exc_info[2]) from e
File "/home/airflow/.local/lib/python3.12/site-packages/sqlalchemy/engine/base.py", line 899, in _rollback_impl
self.connection.dbapi_connection.rollback()
File "/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", line 272, in rollback
self.query("ROLLBACK")
File "/home/airflow/.local/lib/python3.12/site-packages/MySQLdb/connections.py", line 260, in query
_mysql.connection.query(self, query)
MySQLdb.OperationalError: (2006, 'Server has gone away')
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct