Skip to content

[C++][Python] Clarify meaning of "row_group_size" and change default to something more reasonable #34280

@westonpace

Description

@westonpace

Describe the enhancement requested

The row_group_size property in pyarrow.parquet.write_table is described as:

Maximum size of each written row group. If None, the row group size will be the minimum of the Table size and 64 * 1024 * 1024.

This limit is in # of rows but that is not obvious from the description. Furthermore, 64Mi is an extremely high limit for row group size. I believe it is perhaps based on the Java implementation, however the Java implementation treats this number as "bytes" and not "rows" (64MiB row groups is very reasonable).

Perhaps the best solution would be to add support for a limit in terms of bytes. In the meantime, I think we should lower the default limit to 1Mi rows.

Component(s)

C++

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions