-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
We now have two different APIs for writing row groups in parallel, depending on encryption, and I would like to simplify the code to use just one.
The current example for writing row groups in parallel uses get_column_writers and does not support encryption
@rok and @adamreeve added a new API based on ArrowRowGroupWriterFactory for encoding parquet columns and row groups in parallel, with encryption in
This API is also somewhat strange in that it makes users create an ArrowWriter only to immediately destructure it into a SerializedWriter / the underlying writer.
The reason we need to expose ArrowRowGroupWriterFactory is that ArrowRowGroupWriterFactory::create_column_writers also has the appropriate encryption properties whereas get_column_writers does not
Describe the solution you'd like
I would like a single easy to use API for writing in parallel that:
- Is the same for encryption vs not encryption
- Has clear examples
Describe alternatives you've considered
I suggest:
- Make the constructors for
ArrowRowGroupWriterFactorypublic - Update the example to use
ArrowRowGroupWriterFactory/ArrowRowGroupWriterFactory::create_column_writersfunction - Deprecating the existing
get_column_writersfunction directing people toArrowRowGroupWriterFactory - Deprecate
ArrowWriter::into_serialized_writer, directing people toArrowRowGroupWriterFactory
Additional context