Question about Statistics Collection(specifically NDV)

### Background
I've been exploring the statistics collection in DataFusion, particularly for parquet, in the `datafusion/datasource-parquet/src/file_format.rs` file's `infer_stats` method. I noticed that while DataFusion collects statistics like:

- Row counts
- Null counts
- Min/max values
- Total byte size

There doesn't appear to be any logic for computing **NDV (Number of Distinct Values)**. The `distinct_count` field is explicitly set to `Precision::Absent`.

### Is there existing NDV computation?
1. Is there another mechanism in DataFusion for computing NDV that I've missed?
2. Are there plans to implement NDV computation in the future?

### Impact on Query Optimization
Without NDV statistics, the query optimizer might struggle to choose the optimal join orders, especially for queries with multiple joins.  For example, in traditional optimizers, NDV is crucial for estimating join cardinalities and selecting the best join ordering.  If NDV computation isn't currently available, how to ensure accurate join ordering in TPC-H queries? Are there alternative statistics or hints we're using? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Statistics Collection(specifically NDV) #15265

Background

Is there existing NDV computation?

Impact on Query Optimization

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about Statistics Collection(specifically NDV) #15265

Description

Background

Is there existing NDV computation?

Impact on Query Optimization

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions