Skip to content

Change SQL to GCS operators default row group size when output is Parquet #36793

@renzepost

Description

@renzepost

Description

When using an operator that is derived from BaseSQLToGCSOperator with output_format=parquet, the default parquet_row_group_size is 1. This seems like a very strange default setting and with these settings (in my experience) it leads to some very unwanted results: enormous Parquet files, workers running out of memory and long task durations.

I know this parameter is configurable, but my point is that this default setting should be changed to something more usable out of the box.

Use case/motivation

I looked up some other Parquet writing system's default settings. Spark seems to default to 128MB row groups. DuckDB has a default setting of 122.880 rows per row group according to the docs, and Polars uses a default setting of 512^2 rows.

So I think considering this and the unwanted effects I noticed of having 1 row per row group, I'd say the default setting should be changed. However, I'm not sure what would be a good default setting instead for this Airflow operator.

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions