Skip to content

SQL_ALCHEMY_CONN_CMD causes triggerers to fail liveness probes on peak #33485

@agomez-etsy

Description

@agomez-etsy

Apache Airflow version

Other Airflow 2 version (please specify below)

What happened

Airflow version: 2.5.3

Related to this comment from @vchiapaikeo: #33172 (comment)

A couple of mins after midnight UTC - when 100s of DAGs are kicked off - we noticed our triggerer replicas failing liveness probe checks and restarting systematically.

Further profiling led to the discovery that the triggerer’s sync loop hangs for several minutes when there are 1000s of triggers running simultaneously, specifically while bulk fetching triggers, which causes the triggerer to miss heartbeats and eventually get restarted by k8s.

With profiling still enabled, we observed that while the trigger is hanging and we profile the execution, we get this stack trace:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           [506/45463]
        1    0.000    0.000   29.928   29.928 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/orm/query.py:2757(all)               
        1    0.000    0.000   29.923   29.923 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py:1468(all)      
        1    0.000    0.000   29.923   29.923 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py:395(_allrows)  
        1    0.000    0.000   29.923   29.923 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py:1388(_fetchall_impl)       
        1    0.000    0.000   29.923   29.923 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py:1808(_fetchall_impl)                                                                                                                                                                                                                                                                                                                                                                                
        2    0.000    0.000   29.922   14.961 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/orm/loading.py:135(chunks)
        1    0.000    0.000   29.921   29.921 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py:390(_raw_all_rows)
        1    0.001    0.001   29.921   29.921 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/engine/result.py:393(<listcomp>)                                                                                                                                                                                                                                                                                                                                                                                                                        
      125    0.000    0.000   29.919    0.239 /home/airflow/.local/lib/python3.10/site-packages/sqlalchemy/sql/type_api.py:1711(process) 
      125    0.002    0.000   29.915    0.239 /home/airflow/.local/lib/python3.10/site-packages/airflow/utils/sqlalchemy.py:146(process_result_value)  
      125    0.001    0.000   29.909    0.239 /home/airflow/.local/lib/python3.10/site-packages/airflow/utils/sqlalchemy.py:122(db_supports_json)    
      125    0.001    0.000   29.908    0.239 /home/airflow/.local/lib/python3.10/site-packages/airflow/configuration.py:562(get)                  
      125    0.000    0.000   29.907    0.239 /home/airflow/.local/lib/python3.10/site-packages/airflow/configuration.py:732(_get_environment_variables)
      125    0.002    0.000   29.907    0.239 /home/airflow/.local/lib/python3.10/site-packages/airflow/configuration.py:478(_get_env_var_option)
      125    0.002    0.000   29.902    0.239 /home/airflow/.local/lib/python3.10/site-packages/airflow/configuration.py:103(run_command)                
      125    0.001    0.000   29.786    0.238 /usr/local/lib/python3.10/subprocess.py:1110(communicate)                                                       
      125    0.006    0.000   29.785    0.238 /usr/local/lib/python3.10/subprocess.py:1952(_communicate)                                             
      250    0.003    0.000   29.762    0.119 /usr/local/lib/python3.10/selectors.py:403(select)                                         
      250   29.758    0.119   29.758    0.119 {method 'poll' of 'select.poll' objects}                                                                  
      125    0.002    0.000    0.100    0.001 /usr/local/lib/python3.10/subprocess.py:758(__init__)                                                                  
      125    0.004    0.000    0.094    0.001 /usr/local/lib/python3.10/subprocess.py:1687(_execute_child)

Which indicates that airflow is running a subprocess for each fetched row and that takes the vast majority of the execution time.

We found that during the unmarshaling of the resulting rows into the Trigger model, the kwargs column (ExtendedJSON) runs process_returned_value, on each row, and reads the SQL_ALCHEMY_CONN configuration to determine whether the engine supports json or not and parse kwargs accordingly. However, in our case we define SQL_ALCHEMY_CONN_CMD as opposed to SQL_ALCHEMY_CONN, which causes the sync loop to spawn a new subprocess for every row (here).

We workaround it by using SQL_ALCHEMY_CONN instead of SQL_ALCHEMY_CONN_CMD, as it involves reading an environment variable instead of spawning a new subprocess.

What you think should happen instead

The triggerer model caches caches either the SQL_ALCHEMY_CONN or the db_supports_json property.

How to reproduce

Simultaneously kick off 100s of DAGs with at least a few deferrable operators each and use SQL_ALCHEMY_CONN_CMD instead of SQL_ALCHEMY_CONN

Operating System

Debian GNU/Linux 11 (bullseye)

Versions of Apache Airflow Providers

apache-airflow-providers-airbyte==3.2.0
apache-airflow-providers-alibaba==2.2.0
apache-airflow-providers-amazon==7.3.0
apache-airflow-providers-apache-beam==4.3.0
apache-airflow-providers-apache-cassandra==3.1.1
apache-airflow-providers-apache-drill==2.3.1
apache-airflow-providers-apache-druid==3.3.1
apache-airflow-providers-apache-hdfs==3.2.0
apache-airflow-providers-apache-hive==5.1.3
apache-airflow-providers-apache-kylin==3.1.0
apache-airflow-providers-apache-livy==3.3.0
apache-airflow-providers-apache-pig==4.0.0
apache-airflow-providers-apache-pinot==4.0.1
apache-airflow-providers-apache-spark==4.0.0
apache-airflow-providers-apache-sqoop==3.1.1
apache-airflow-providers-arangodb==2.1.1
apache-airflow-providers-asana==2.1.0
apache-airflow-providers-atlassian-jira==2.0.1
apache-airflow-providers-celery==3.1.0
apache-airflow-providers-cloudant==3.1.0
apache-airflow-providers-cncf-kubernetes==5.2.2
apache-airflow-providers-common-sql==1.3.4
apache-airflow-providers-databricks==4.0.0
apache-airflow-providers-datadog==3.1.0
apache-airflow-providers-dbt-cloud==3.1.0
apache-airflow-providers-dingding==3.1.0
apache-airflow-providers-discord==3.1.0
apache-airflow-providers-docker==3.5.1
apache-airflow-providers-elasticsearch==4.4.0
apache-airflow-providers-exasol==4.1.3
apache-airflow-providers-facebook==3.1.0
apache-airflow-providers-ftp==3.3.1
apache-airflow-providers-github==2.2.1
apache-airflow-providers-google==8.11.0
apache-airflow-providers-grpc==3.1.0
apache-airflow-providers-hashicorp==3.3.0
apache-airflow-providers-http==4.2.0
apache-airflow-providers-imap==3.1.1
apache-airflow-providers-influxdb==2.1.0
apache-airflow-providers-jdbc==3.3.0
apache-airflow-providers-jenkins==3.2.0
apache-airflow-providers-microsoft-azure==5.2.1
apache-airflow-providers-microsoft-mssql==3.3.2
apache-airflow-providers-microsoft-psrp==2.2.0
apache-airflow-providers-microsoft-winrm==3.1.1
apache-airflow-providers-mongo==3.1.1
apache-airflow-providers-mysql==4.0.2
apache-airflow-providers-neo4j==3.2.1
apache-airflow-providers-odbc==3.2.1
apache-airflow-providers-openfaas==3.1.0
apache-airflow-providers-opsgenie==5.0.0
apache-airflow-providers-oracle==3.6.0
apache-airflow-providers-pagerduty==3.1.0
apache-airflow-providers-papermill==3.1.1
apache-airflow-providers-plexus==3.1.0
apache-airflow-providers-postgres==5.4.0
apache-airflow-providers-presto==4.2.2
apache-airflow-providers-qubole==3.3.1
apache-airflow-providers-redis==3.1.0
apache-airflow-providers-salesforce==5.3.0
apache-airflow-providers-samba==4.1.0
apache-airflow-providers-segment==3.1.0
apache-airflow-providers-sendgrid==3.1.0
apache-airflow-providers-sftp==4.2.4
apache-airflow-providers-singularity==3.1.0
apache-airflow-providers-slack==7.2.0
apache-airflow-providers-snowflake==4.0.4
apache-airflow-providers-sqlite==3.3.1
apache-airflow-providers-ssh==3.5.0
apache-airflow-providers-tableau==4.1.0
apache-airflow-providers-tabular==1.1.0
apache-airflow-providers-telegram==4.0.0
apache-airflow-providers-trino==4.3.2
apache-airflow-providers-vertica==3.3.1
apache-airflow-providers-yandex==3.3.0
apache-airflow-providers-zendesk==4.2.0

Deployment

Other 3rd-party Helm chart

Deployment details

Chart based on the official helm chart. Airflow running on Google Kubernetes Engine (GKE) using KubernetesExecutor.

Anything else

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    area:corekind:bugThis is a clearly a bugneeds-triagelabel for new issues that we didn't triage yet

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions