Skip to content

Concatenate parquet files without deserializing? #1711

@wjones127

Description

@wjones127

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

This is a random idea, but it seems like it would be valuable to be able to concatenate parquet files without deserializing to Arrow and re-serializing back to Parquet. I'm not 100% sure that it would be possible, but does seem like you should in theory be able to just copy the row group buffers and then update the offsets within the row group metadata in the footer.

You can only do this if the schemas match, of course.

Describe the solution you'd like

If this is indeed possible, then some function like (apologies, my Rust interface design isn't great yet):

fn merge_files(readers: Vec<SerializedFileReader>, writer: impl FileWriter) -> Result<()>;

Describe alternatives you've considered

The obvious alternative is to simple read as Arrow, concatenate, and then serialize back, but reading and writing parquet is famously compute intensive, so would be nice if we could avoid that.

Additional context

Concatenating parquet files is a common operation in Delta Lake tables, which may initially write out many small files that later need to be merged for better read performance. See delta-io/delta-rs#98.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementAny new improvement worthy of a entry in the changelog

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions