Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions airflow/providers/google/CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,16 @@
Changelog
---------

.. note::
The default value of ``parquet_row_group_size`` in ``BaseSQLToGCSOperator`` has changed from 1 to
100000, in order to have a default that provides better compression efficiency and performance of
reading the data in the output Parquet files. In many cases, the previous value of 1 resulted in
very large files, long task durations and out of memory issues. A default value of 100000 may require
more memory to execute the operator, in which case users can override the ``parquet_row_group_size``
parameter in the operator. All operators that are derived from ``BaseSQLToGCSOperator`` are affected
when ``export_format`` is ``parquet``: ``MySQLToGCSOperator``, ``PrestoToGCSOperator``,
``OracleToGCSOperator``, ``TrinoToGCSOperator``, ``MSSQLToGCSOperator`` and ``PostgresToGCSOperator``. Due to the above we treat this change as bug fix.

10.13.1
.......

Expand Down
4 changes: 2 additions & 2 deletions airflow/providers/google/cloud/transfers/sql_to_gcs.py
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ class BaseSQLToGCSOperator(BaseOperator):
:param parquet_row_group_size: The approximate number of rows in each row group
when using parquet format. Using a large row group size can reduce the file size
and improve the performance of reading the data, but it needs more memory to
execute the operator. (default: 1)
execute the operator. (default: 100000)
"""

template_fields: Sequence[str] = (
Expand Down Expand Up @@ -123,7 +123,7 @@ def __init__(
exclude_columns: set | None = None,
partition_columns: list | None = None,
write_on_empty: bool = False,
parquet_row_group_size: int = 1,
parquet_row_group_size: int = 100000,
**kwargs,
) -> None:
super().__init__(**kwargs)
Expand Down