Open
Description
Is your feature request related to a problem or challenge?
- This is a follow on to the feature added by @adriangb in Add late pruning of Parquet files based on file level statistics #16014
@adriangb added the great feature that can prune entire files while opening many parquet files
The current statistics for DataSourceExec
have information on how many row groups were pruned, it would also be great to add statistics on how many FILES were pruned by this new code
For example, with clickbench Q24 here is an excerpt from the file
EXPLAIN ANALYZE SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "SearchPhrase" LIMIT 10;
| | DataSourceExec:...
pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=325, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0
Describe the solution you'd like
I would like some new statistics that record:
files_pruned
: total files that were pruned by filters during open
It is important to make sure the docs explain the metric only describes files pruned after the plan starts (not files that are pruned during planning)
Describe alternatives you've considered
- Add a field to
ParquetFileMetrics
: https://github.com/apache/datafusion/blob/6d5e00ad3f8e53f7252cb1d3c72a6c7f28c1aed6/datafusion/datasource-parquet/src/metrics.rs#L29-L28 - Thread that through to the opener in
datafusion/datasource-parquet/src/opener.rs
so when files are pruned we can see that in the metrics
Additional context
No response