Splice Parquet Data

**Is your feature request related to a problem or challenge? Please describe what you are trying to do.**


A common request is to be able to combine parquet files together without re-encoding data (#557) (#4150). However, correctly translating the metadata is non-trivial, and requires care to ensure the relevant file offsets are correctly updated.

**Describe the solution you'd like**


I would like an API on `SerializedRowGroupWriter` that lets me append an existing ColumnChunk from another source. For example,

```
/// Splice a column from another file without decoding it
///
/// This can be used for efficiently concatenating or projecting parquet data
pub fn splice_column<R: ChunkReader>(&mut self, reader: &R, metadata: &ColumnChunkMetaData) -> Result<()> {
```

I originally debated making the signature

```
pub fn splice_column(&mut self, column: &dyn PageReader) -> Result<()> {
```

But this runs into a couple of problems

* The PageReader returns uncompressed, decoded pages (although the value data is still encoded)
* It isn't clear how to preserve the page index or any bloom filter information

I also debated allowing appending pages individually, however, in addition to the above problems it runs into:

* A column chunk can only have a single dictionary page
* The compression codec is specified at the column chunk level

The downside of the `ChunkReader` API is that potentially someone could pass a reader that doesn't match the `ColumnChunkMetaData`, which would result in an inconsistent parquet file. I'm inclined to think this isn't a problem, as there are plenty of other ways to generate an invalid "parquet" file :sweat_smile: 

**Describe alternatives you've considered**


**Additional context**

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Splice Parquet Data #4155

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Splice Parquet Data #4155

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions