-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Apache Airflow Provider(s)
amazon
Versions of Apache Airflow Providers
apache-airflow-providers-amazon==8.28.0
Apache Airflow version
2.10.3
Operating System
macOS
Deployment
Astronomer
Deployment details
No response
What happened
While using the SqlToS3Operator, I've noticed that the memory usage on the worker often exceeds expectations for large datasets. For instance, a SQL query that results in a ~5 GB .txt export, the task consumes approximately 10 GB of memory during execution.
I believe this is because the operator first loads the entire query result into a Pandas DataFrame (in memory), writes it to a local temporary file, and then uploads that file to S3.
Temporary File Name Leakage in .gz Files
Additionally, when exporting to .gz files (e.g., example.txt.gz) and opening/unzipping the .gz file in Windows using WindowsExplorer/WinRar/7zip, the content reveals the temporary filename that was generated internally using Python’s NamedTemporaryFile()
Note:- This issue occurs only when opening/unzipping the .gz file on Windows; it does not appear on macOS or Linux.
What you think should happen instead
-
Support direct upload to S3 or through multiparts or file-like buffers to avoid writing to disk
-
Control the internal filename for compressed outputs to avoid leaking temporary names
How to reproduce
DAG Snippet
extract_operator = SqlToS3Operator(
retries=3,
retry_delay=timedelta(minutes=1),
retry_exponential_backoff=True,
task_id='extract_example_data',
sql_conn_id="redshift_conn",
query="SELECT * from example;",
s3_bucket="{{ var.value.S3_BUCKET }}",
s3_key="example.txt.gz",
aws_conn_id="aws_datalake_conn",
file_format="csv",
replace=True,
pd_kwargs={
"sep": "|",
"index": False,
"quoting": csv.QUOTE_MINIMAL,
"compression": "gzip"
},
)
Anything else
Code reference
https://github.com/apache/airflow/blob/main/providers/amazon/src/airflow/providers/amazon/aws/transfers/sql_to_s3.py#L195-L203
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
