Description
Describe the bug
This was mentioned in #15742 (comment) and discussed in detail in apache/parquet-format#221, but datafusion is over-aggressive in pruning floating point columns. The issue appears with predicates of the form x [gt|lt] literal
. Consider a column consisting of [1.0, 0.0, -1.0, NaN, -2.0]
, the max will be 1 and the min -2. A query like select * from ... where x > 2
will return no rows because no chunk exists where max > 2
.
To Reproduce
> select * from 'parquet-testing/data/float16_nonzeros_and_nans.parquet' where x > arrow_cast(2.0, 'Float16');
+---+
| x |
+---+
+---+
0 row(s) fetched.
Expected behavior
The above query should return a single row containing NaN
.
Additional context
The Parquet community is considering changes to allow for NaN
in statistics, with the currently favored approach being adding a new ColumnOrder
to the specification. This will correct the issue above, but datafusion will need to check the ColumnOrder
to know whether or not floating point statistics can be trusted.
Also note that if/when apache/parquet-format#221 is merged, other predicates such as isnan(x)
might be candidates for pruning, but that is an optimization.