-
Couldn't load subscription status.
- Fork 1k
Description
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
A common request is to be able to combine parquet files together without re-encoding data (#557) (#4150). However, correctly translating the metadata is non-trivial, and requires care to ensure the relevant file offsets are correctly updated.
Describe the solution you'd like
I would like an API on SerializedRowGroupWriter that lets me append an existing ColumnChunk from another source. For example,
/// Splice a column from another file without decoding it
///
/// This can be used for efficiently concatenating or projecting parquet data
pub fn splice_column<R: ChunkReader>(&mut self, reader: &R, metadata: &ColumnChunkMetaData) -> Result<()> {
I originally debated making the signature
pub fn splice_column(&mut self, column: &dyn PageReader) -> Result<()> {
But this runs into a couple of problems
- The PageReader returns uncompressed, decoded pages (although the value data is still encoded)
- It isn't clear how to preserve the page index or any bloom filter information
I also debated allowing appending pages individually, however, in addition to the above problems it runs into:
- A column chunk can only have a single dictionary page
- The compression codec is specified at the column chunk level
The downside of the ChunkReader API is that potentially someone could pass a reader that doesn't match the ColumnChunkMetaData, which would result in an inconsistent parquet file. I'm inclined to think this isn't a problem, as there are plenty of other ways to generate an invalid "parquet" file 😅
Describe alternatives you've considered
Additional context