[C++][Parquet] Add support for BYTE_STREAM_SPLIT encoding

**From the Parquet issue ( https://issues.apache.org/jira/browse/PARQUET-1622 ):**

Apache Parquet does not have any encodings suitable for FP data and the available text compressors (zstd, gzip, etc) do not handle FP data very well.

It is possible to apply a simple data transformation named "stream splitting". Such could be "byte stream splitting" which creates K streams of length N where K is the number of bytes in the data type (4 for floats, 8 for doubles) and N is the number of elements in the sequence.

The transformed data compresses significantly better on average than the original data and for some cases there is a performance improvement in compression and decompression speed.

You can read a more detailed report here:
 [https://drive.google.com/file/d/1wfLQyO2G5nofYFkS7pVbUW0-oJkQqBvv/view

**Apache Arrow can benefit from the reduced requirements for storing FP parquet column data and improvements in decompression speed.**

**Reporter**: [Martin Radev](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=martinradev) / @martinradev
**Assignee**: [Martin Radev](https://issues.apache.org/jira/secure/ViewProfile.jspa?name=martinradev) / @martinradev
#### Related issues:
- [[Format] Support lossy compression](https://issues.apache.org/jira/browse/ARROW-6282) (is related to)
#### PRs and other links:
- [GitHub Pull Request #6005](https://github.com/apache/arrow/pull/6005)

<sub>**Note**: *This issue was originally created as [PARQUET-1716](https://issues.apache.org/jira/browse/PARQUET-1716). Please see the [migration documentation](https://issues.apache.org/jira/browse/PARQUET-2502) for further details.*</sub>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++][Parquet] Add support for BYTE_STREAM_SPLIT encoding #42372

Related issues:

PRs and other links:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++][Parquet] Add support for BYTE_STREAM_SPLIT encoding #42372

Description

Related issues:

PRs and other links:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions