Skip to content

Improve metrics in DataSourceExec with Parquet source #18195

@2010YOUY01

Description

@2010YOUY01

Is your feature request related to a problem or challenge?

When I am working on setting the default metrics in parquet scanner in #18116, I have several ideas to further improve the metrics accounting in EXPLAIN ANALYZE for the parquet scanner.

Support a new metric value files_ranges_matched_statistics

There is a existing metric files_ranges_pruned_statistics

pub files_ranges_pruned_statistics: Count,

It would be good also to display how many files ranges are matched to make it more comprehensive, similar to the existing row-group/page level metrics.

Add a new metric value scan_efficiency_ratio

I think it would be helpful to track:

scan_efficiency_ratio -- bytes_scanned / total_file_size, as a quick insight for the overall pruning effectiveness

Fix elapsed_compute baseline metrics not counting issue

It seems currently the elapsed_compute baseline metric is not tracked, you can try any whole file scan on parquet source in datafusion-cli, the metric will be unrealistically low:

DataFusion CLI v50.2.0
> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
STORED AS parquet
LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';
0 row(s) fetched.
Elapsed 0.049 seconds.

> explain analyze
select * from lineitem;

...elapsed_compute = 14ns...

Add a new metric type for the general pruning-related metrics

There are many levels of pruning inside parquet scanner: file range/row group stat/row group bloom filter/page index, ...
It's currently displayed like row_groups_matched_statistics=3, row_groups_pruned_statistics=7

I think display it as row_groups_statistics_pruning= 10 total -> 3 matched looks better, and can make the lengthy existing metrics output more concise.

To do it, we can add a new metric value type, and change its display implementation.

Describe the solution you'd like

No response

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requesthelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions