Skip to content

Splice Parquet Data #4155

@tustvold

Description

@tustvold

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

A common request is to be able to combine parquet files together without re-encoding data (#557) (#4150). However, correctly translating the metadata is non-trivial, and requires care to ensure the relevant file offsets are correctly updated.

Describe the solution you'd like

I would like an API on SerializedRowGroupWriter that lets me append an existing ColumnChunk from another source. For example,

/// Splice a column from another file without decoding it
///
/// This can be used for efficiently concatenating or projecting parquet data
pub fn splice_column<R: ChunkReader>(&mut self, reader: &R, metadata: &ColumnChunkMetaData) -> Result<()> {

I originally debated making the signature

pub fn splice_column(&mut self, column: &dyn PageReader) -> Result<()> {

But this runs into a couple of problems

  • The PageReader returns uncompressed, decoded pages (although the value data is still encoded)
  • It isn't clear how to preserve the page index or any bloom filter information

I also debated allowing appending pages individually, however, in addition to the above problems it runs into:

  • A column chunk can only have a single dictionary page
  • The compression codec is specified at the column chunk level

The downside of the ChunkReader API is that potentially someone could pass a reader that doesn't match the ColumnChunkMetaData, which would result in an inconsistent parquet file. I'm inclined to think this isn't a problem, as there are plenty of other ways to generate an invalid "parquet" file 😅

Describe alternatives you've considered

Additional context

Metadata

Metadata

Assignees

Labels

enhancementAny new improvement worthy of a entry in the changelogparquetChanges to the parquet crate

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions