-
Notifications
You must be signed in to change notification settings - Fork 16.4k
Description
Description
When using an operator that is derived from BaseSQLToGCSOperator with output_format=parquet, the default parquet_row_group_size is 1. This seems like a very strange default setting and with these settings (in my experience) it leads to some very unwanted results: enormous Parquet files, workers running out of memory and long task durations.
I know this parameter is configurable, but my point is that this default setting should be changed to something more usable out of the box.
Use case/motivation
I looked up some other Parquet writing system's default settings. Spark seems to default to 128MB row groups. DuckDB has a default setting of 122.880 rows per row group according to the docs, and Polars uses a default setting of 512^2 rows.
So I think considering this and the unwanted effects I noticed of having 1 row per row group, I'd say the default setting should be changed. However, I'm not sure what would be a good default setting instead for this Airflow operator.
Related issues
No response
Are you willing to submit a PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct