Skip to content

[C++][Parquet] Add support for BYTE_STREAM_SPLIT encoding #42372

@asfimport

Description

@asfimport

From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 ):

Apache Parquet does not have any encodings suitable for FP data and the available text compressors (zstd, gzip, etc) do not handle FP data very well.

It is possible to apply a simple data transformation named "stream splitting". Such could be "byte stream splitting" which creates K streams of length N where K is the number of bytes in the data type (4 for floats, 8 for doubles) and N is the number of elements in the sequence.

The transformed data compresses significantly better on average than the original data and for some cases there is a performance improvement in compression and decompression speed.

You can read a more detailed report here:
[https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

Apache Arrow can benefit from the reduced requirements for storing FP parquet column data and improvements in decompression speed.

Reporter: Martin Radev / @martinradev
Assignee: Martin Radev / @martinradev

Related issues:

PRs and other links:

Note: This issue was originally created as PARQUET-1716. Please see the migration documentation for further details.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions