Skip to content

DeltaBitPackEncoder Pads Miniblock BitWidths With Arbitrary Values #1416

@tustvold

Description

@tustvold

Describe the bug

https://github.com/apache/arrow-rs/blob/master/parquet/src/encodings/encoding.rs#L577 skips over the miniblock bit widths, and then only goes back and writes a value for the miniblocks that contain a non-zero number of values. The empty miniblocks are left with whatever value happens to be in the encoder's buffer.

To Reproduce

This is one of the underlying bugs behind apache/datafusion#1976

Expected behavior

Whilst the specification technically allows for arbitrary padding, it seems like a good idea to avoid non-deterministic output where possible

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugparquetChanges to the parquet crate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions