-
Notifications
You must be signed in to change notification settings - Fork 1.9k
Description
Is your feature request related to a problem or challenge?
When I am working on setting the default metrics in parquet scanner in #18116, I have several ideas to further improve the metrics accounting in EXPLAIN ANALYZE for the parquet scanner.
- Support a new metric value
files_ranges_matched_statisticsfeat: IntroducePruningMetricsand use it in parquet file pruning metric #18297 - Add a new metric value
scan_efficiency_ratiofeat(parquet): Implementscan_efficiency_ratiometric for parquet reading #18577 - Fix
elapsed_computebaseline metrics not counting issue - Add a new metric type for the general pruning-related metrics feat: Introduce
PruningMetricsand use it in parquet file pruning metric #18297
Support a new metric value files_ranges_matched_statistics
There is a existing metric files_ranges_pruned_statistics
| pub files_ranges_pruned_statistics: Count, |
It would be good also to display how many files ranges are matched to make it more comprehensive, similar to the existing row-group/page level metrics.
Add a new metric value scan_efficiency_ratio
I think it would be helpful to track:
scan_efficiency_ratio -- bytes_scanned / total_file_size, as a quick insight for the overall pruning effectiveness
Fix elapsed_compute baseline metrics not counting issue
It seems currently the elapsed_compute baseline metric is not tracked, you can try any whole file scan on parquet source in datafusion-cli, the metric will be unrealistically low:
DataFusion CLI v50.2.0
> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
STORED AS parquet
LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';
0 row(s) fetched.
Elapsed 0.049 seconds.
> explain analyze
select * from lineitem;
...elapsed_compute = 14ns...
Add a new metric type for the general pruning-related metrics
There are many levels of pruning inside parquet scanner: file range/row group stat/row group bloom filter/page index, ...
It's currently displayed like row_groups_matched_statistics=3, row_groups_pruned_statistics=7
I think display it as row_groups_statistics_pruning= 10 total -> 3 matched looks better, and can make the lengthy existing metrics output more concise.
To do it, we can add a new metric value type, and change its display implementation.
Describe the solution you'd like
No response
Describe alternatives you've considered
No response
Additional context
No response