Skip to content

Pruning of floating point Parquet columns is incorrect when NaN is present #15812

Open
@etseidl

Description

@etseidl

Describe the bug

This was mentioned in #15742 (comment) and discussed in detail in apache/parquet-format#221, but datafusion is over-aggressive in pruning floating point columns. The issue appears with predicates of the form x [gt|lt] literal. Consider a column consisting of [1.0, 0.0, -1.0, NaN, -2.0], the max will be 1 and the min -2. A query like select * from ... where x > 2 will return no rows because no chunk exists where max > 2.

To Reproduce

> select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16');
+---+
| x |
+---+
+---+
0 row(s) fetched. 

Expected behavior

The above query should return a single row containing NaN.

Additional context

The Parquet community is considering changes to allow for NaN in statistics, with the currently favored approach being adding a new ColumnOrder to the specification. This will correct the issue above, but datafusion will need to check the ColumnOrder to know whether or not floating point statistics can be trusted.

Also note that if/when apache/parquet-format#221 is merged, other predicates such as isnan(x) might be candidates for pruning, but that is an optimization.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions