-
Notifications
You must be signed in to change notification settings - Fork 1.8k
Description
Is your feature request related to a problem or challenge?
Previously, by default, EXPLAIN ANALYZE will show all available metrics inside an operator. It can get quite verbose for some operator.
In datafusion-cli:
> CREATE EXTERNAL TABLE IF NOT EXISTS lineitem
STORED AS parquet
LOCATION '/Users/yongting/Code/datafusion/benchmarks/data/tpch_sf1/lineitem';
0 row(s) fetched.
Elapsed 0.000 seconds.
> explain analyze select *
from lineitem
where l_orderkey = 3000000;
The parquet reader includes a large number of low-level details:
metrics=[output_rows=19813, elapsed_compute=14ns, batches_split=0, bytes_scanned=2147308, file_open_errors=0, file_scan_errors=0, files_ranges_pruned_statistics=18, num_predicate_creation_errors=0, page_index_rows_matched=19813, page_index_rows_pruned=729088, predicate_cache_inner_records=0, predicate_cache_records=0, predicate_evaluation_errors=0, pushdown_rows_matched=0, pushdown_rows_pruned=0, row_groups_matched_bloom_filter=0, row_groups_matched_statistics=1, row_groups_pruned_bloom_filter=0, row_groups_pruned_statistics=0, bloom_filter_eval_time=21.997µs, metadata_load_time=273.83µs, page_index_eval_time=29.915µs, row_pushdown_eval_time=42ns, statistics_eval_time=76.248µs, time_elapsed_opening=4.02146ms, time_elapsed_processing=24.787461ms, time_elapsed_scanning_total=24.17671ms, time_elapsed_scanning_until_data=23.103665ms]
I believe only a subset of it is commonly used, for example output_rows, metadata_load_time, and how many file/row-group/pages are pruned, and it would better to only display the most common ones by default.
After #18098, the EXPLAIN ANALYZE detail level can be controlled through an option
> set datafusion.explain.analyze_level = summary;
0 row(s) fetched.
Elapsed 0.000 seconds.
> explain analyze select * from generate_series(10000) as t1(v1) order by v1 desc;
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | SortExec: expr=[v1@0 DESC], preserve_partitioning=[false], metrics=[output_rows=10001, elapsed_compute=100µs] |
| | ProjectionExec: expr=[value@0 as v1], metrics=[output_rows=10001, elapsed_compute=1.166µs] |
| | LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10000, batch_size=8192], metrics=[output_rows=10001, elapsed_compute=43µs] |
| | |
+-------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.001 seconds.
> set datafusion.explain.analyze_level = dev;
0 row(s) fetched.
Elapsed 0.000 seconds.
> explain analyze select * from generate_series(10000) as t1(v1) order by v1 desc;
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Plan with Metrics | SortExec: expr=[v1@0 DESC], preserve_partitioning=[false], metrics=[output_rows=10001, elapsed_compute=222.043µs, spill_count=0, spilled_bytes=0.0 B, spilled_rows=0, batches_split=2] |
| | ProjectionExec: expr=[value@0 as v1], metrics=[output_rows=10001, elapsed_compute=2.584µs] |
| | LazyMemoryExec: partitions=1, batch_generators=[generate_series: start=0, end=10000, batch_size=8192], metrics=[output_rows=10001, elapsed_compute=162.625µs] |
| | |
+-------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.003 seconds.
Now only output_rows and elapsed_compute are included in the summary analyze level, and the dev level is the default.
The goal of this issue is, for each operator:
- Investigate what's the most commonly used metrics for high-level insights, postgres
and DuckDB's result can be used as reference - Set those common metrics to
summarytype
And finally set summary analyze level to default.
Describe the solution you'd like
Identify which metrics should be included in the summary level for all operators.
- DataSourceExec
- DataSinkExec
- AnalyzeExec
- AsyncFuncExec
- AggregateExec
- CoalesceBatchesExec
- CoalescePartitionsExec
- CooperativeExec
- CrossJoinExec
- EmptyExec
- ExplainExec
- FilterExec
- GlobalLimitExec
- LocalLimitExec
- HashJoinExec
- NestedLoopJoinExec
- SortMergeJoinExec
- SymmetricHashJoinExec
- PlaceholderRowExec
- ProjectionExec
- RecursiveQueryExec
- RepartitionExec
- SortExec
- PartialSortExec
- SortPreservingMergeExec
- StreamingTableExec
- UnionExec
- InterleaveExec
- UnnestExec
- WindowAggExec
- BoundedWindowAggExec
- WorkTableExec
- LazyMemoryExec
Describe alternatives you've considered
No response
Additional context
No response