Skip to content

Unify API for writing column chunks / row groups in parallel #8389

@alamb

Description

@alamb

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

We now have two different APIs for writing row groups in parallel, depending on encryption, and I would like to simplify the code to use just one.

The current example for writing row groups in parallel uses get_column_writers and does not support encryption

@rok and @adamreeve added a new API based on ArrowRowGroupWriterFactory for encoding parquet columns and row groups in parallel, with encryption in

This API is also somewhat strange in that it makes users create an ArrowWriter only to immediately destructure it into a SerializedWriter / the underlying writer.

The reason we need to expose ArrowRowGroupWriterFactory is that ArrowRowGroupWriterFactory::create_column_writers also has the appropriate encryption properties whereas get_column_writers does not

Describe the solution you'd like
I would like a single easy to use API for writing in parallel that:

  1. Is the same for encryption vs not encryption
  2. Has clear examples

Describe alternatives you've considered
I suggest:

  1. Make the constructors for ArrowRowGroupWriterFactory public
  2. Update the example to use ArrowRowGroupWriterFactory / ArrowRowGroupWriterFactory::create_column_writers function
  3. Deprecating the existing get_column_writers function directing people to ArrowRowGroupWriterFactory
  4. Deprecate ArrowWriter::into_serialized_writer, directing people to ArrowRowGroupWriterFactory

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions