Skip to content

High Memory Usage and Temp File Name Leak in SqlToS3Operator #50159

@aditya-emmanuel

Description

@aditya-emmanuel

Apache Airflow Provider(s)

amazon

Versions of Apache Airflow Providers

apache-airflow-providers-amazon==8.28.0

Apache Airflow version

2.10.3

Operating System

macOS

Deployment

Astronomer

Deployment details

No response

What happened

While using the SqlToS3Operator, I've noticed that the memory usage on the worker often exceeds expectations for large datasets. For instance, a SQL query that results in a ~5 GB .txt export, the task consumes approximately 10 GB of memory during execution.

I believe this is because the operator first loads the entire query result into a Pandas DataFrame (in memory), writes it to a local temporary file, and then uploads that file to S3.

Temporary File Name Leakage in .gz Files

Additionally, when exporting to .gz files (e.g., example.txt.gz) and opening/unzipping the .gz file in Windows using WindowsExplorer/WinRar/7zip, the content reveals the temporary filename that was generated internally using Python’s NamedTemporaryFile()

Note:- This issue occurs only when opening/unzipping the .gz file on Windows; it does not appear on macOS or Linux.

Image

What you think should happen instead

  • Support direct upload to S3 or through multiparts or file-like buffers to avoid writing to disk

  • Control the internal filename for compressed outputs to avoid leaking temporary names

How to reproduce

DAG Snippet

extract_operator = SqlToS3Operator(
     retries=3,
     retry_delay=timedelta(minutes=1),
     retry_exponential_backoff=True,
     task_id='extract_example_data',
     sql_conn_id="redshift_conn",
     query="SELECT * from example;",
     s3_bucket="{{ var.value.S3_BUCKET }}",
     s3_key="example.txt.gz",
     aws_conn_id="aws_datalake_conn",
     file_format="csv",
     replace=True,
     pd_kwargs={
         "sep": "|",
         "index": False,
         "quoting": csv.QUOTE_MINIMAL,
         "compression": "gzip"
     },
 )

Anything else

Code reference
https://github.com/apache/airflow/blob/main/providers/amazon/src/airflow/providers/amazon/aws/transfers/sql_to_s3.py#L195-L203

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions