Background
I've been exploring the statistics collection in DataFusion, particularly for parquet, in the datafusion/datasource-parquet/src/file_format.rs file's infer_stats method. I noticed that while DataFusion collects statistics like:
- Row counts
- Null counts
- Min/max values
- Total byte size
There doesn't appear to be any logic for computing NDV (Number of Distinct Values). The distinct_count field is explicitly set to Precision::Absent.
Is there existing NDV computation?
- Is there another mechanism in DataFusion for computing NDV that I've missed?
- Are there plans to implement NDV computation in the future?
Impact on Query Optimization
Without NDV statistics, the query optimizer might struggle to choose the optimal join orders, especially for queries with multiple joins. For example, in traditional optimizers, NDV is crucial for estimating join cardinalities and selecting the best join ordering. If NDV computation isn't currently available, how to ensure accurate join ordering in TPC-H queries? Are there alternative statistics or hints we're using?
Background
I've been exploring the statistics collection in DataFusion, particularly for parquet, in the
datafusion/datasource-parquet/src/file_format.rsfile'sinfer_statsmethod. I noticed that while DataFusion collects statistics like:There doesn't appear to be any logic for computing NDV (Number of Distinct Values). The
distinct_countfield is explicitly set toPrecision::Absent.Is there existing NDV computation?
Impact on Query Optimization
Without NDV statistics, the query optimizer might struggle to choose the optimal join orders, especially for queries with multiple joins. For example, in traditional optimizers, NDV is crucial for estimating join cardinalities and selecting the best join ordering. If NDV computation isn't currently available, how to ensure accurate join ordering in TPC-H queries? Are there alternative statistics or hints we're using?