Skip to content

ParquetExec::statistics::is_exact likely wrong/misunderstood #5614

@crepererum

Description

@crepererum

A ParquetExec is created from a FileScanConfig and an optional filter predicate1. These two are different, independent parameters -- at least the documentation is not implying that the predicate should be considered when constructing the FileScanConfig. Now the statistics for the ParquetExec are calculated by FileScanConfig::project:

https://github.com/apache/arrow-datafusion/blob/0f6931caa6f8b48e116a8e77e989c404f31f3f8d/datafusion/core/src/physical_plan/file_format/mod.rs#L213-L219

This forwards is_exact from the input which might have been set to true. However there is a predicate, is_exact should likely be false because some data may be removed which will mess up the exact statistic. So either the forwarding is wrong (at least when a predicate is given) or the docs are imprecise.

Note that this is unrelated to #5613 because this issue here is about the is_exact=true case.

Footnotes

  1. And a metadata size hint, but this is irrelevant here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions