Update tpch, clickbench, sort_tpch to mark failed queries #16182

ding-young · 2025-05-24T16:23:50Z

Which issue does this PR close?

Closes Update tpch, clickbench, sort_tpch to mark failed queries #16160 .

Rationale for this change

To track whether certain queries fail on limited memory, we need to continue executing benchmark queries even when previous one fails. Also, it would be better to mark & visualize the failure with compare.py.

What changes are included in this PR?

tpch, clickbench, sort_tpch executes next query when current query fails on any iteration. (It will not execute the next iteration of failed query)
In terminal, Error message is printed as [], and it will print the ids of failed queries.
New field success added to output json, updated compare.py to compare the elapsed time of successful results, not failed ones.

Are these changes tested?

Since it's benchmark suite, I tested it manually.

// Sample output in terminal
Query 18 failed: Resources exhausted: Additional allocation failed with top memory consumers (across reservations) as:
  HashJoinInput[1]#1239(can spill: false) consumed 9.9 MB,
  HashJoinInput[3]#1246(can spill: false) consumed 9.7 MB,
  HashJoinInput[2]#1251(can spill: false) consumed 9.6 MB,
  HashJoinInput[0]#1235(can spill: false) consumed 9.2 MB,
  HashJoinInput[0]#1266(can spill: false) consumed 2.6 MB.
Error: Failed to allocate additional 665.3 KB for HashJoinInput[0] with 9.2 MB already allocated for this reservation - 278.5 KB remain available for the total pool
Query 19 iteration 0 took 584.6 ms and returned 1 rows
Query 19 iteration 1 took 546.7 ms and returned 1 rows
Query 19 avg time: 565.66 ms
Query 20 iteration 0 took 587.3 ms and returned 186 rows
Query 20 iteration 1 took 551.7 ms and returned 186 rows
Query 20 avg time: 569.48 ms
Query 21 iteration 0 took 784.4 ms and returned 411 rows
Query 21 iteration 1 took 773.0 ms and returned 411 rows
Query 21 avg time: 778.71 ms
Query 22 iteration 0 took 295.4 ms and returned 7 rows
Query 22 iteration 1 took 299.7 ms and returned 7 rows
Query 22 avg time: 297.55 ms
Failed Queries: 4, 7, 9, 10, 13, 16, 18

// compare.py 
┏━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  results ┃  results ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1     │ 409.33ms │ 433.54ms │  1.06x slower │
│ QQuery 2     │ 343.53ms │ 323.37ms │ +1.06x faster │
│ QQuery 3     │ 499.94ms │ 491.93ms │     no change │
│ QQuery 4     │     FAIL │ 350.74ms │  incomparable │
│ QQuery 5     │ 529.90ms │ 542.79ms │     no change │
│ QQuery 6     │ 242.10ms │ 253.38ms │     no change │
│ QQuery 7     │     FAIL │ 528.23ms │  incomparable │
│ QQuery 8     │ 634.55ms │ 638.16ms │     no change │
│ QQuery 9     │     FAIL │ 689.83ms │  incomparable │
│ QQuery 10    │     FAIL │ 601.86ms │  incomparable │
│ QQuery 11    │ 205.05ms │ 205.76ms │     no change │
│ QQuery 12    │ 421.38ms │ 418.29ms │     no change │
│ QQuery 13    │     FAIL │ 418.92ms │  incomparable │
│ QQuery 14    │ 346.10ms │ 345.50ms │     no change │
│ QQuery 15    │ 452.58ms │ 448.31ms │     no change │
│ QQuery 16    │     FAIL │ 256.87ms │  incomparable │
│ QQuery 17    │ 762.77ms │ 627.49ms │ +1.22x faster │
│ QQuery 18    │     FAIL │ 830.38ms │  incomparable │
│ QQuery 19    │ 546.70ms │ 534.35ms │     no change │
│ QQuery 20    │ 551.65ms │ 465.91ms │ +1.18x faster │
│ QQuery 21    │ 772.99ms │ 789.56ms │     no change │
│ QQuery 22    │ 295.41ms │ 301.68ms │     no change │
└──────────────┴──────────┴──────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━┓
┃ Benchmark Summary      ┃           ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━┩
│ Total Time (results)   │ 7013.98ms │
│ Total Time (results)   │ 6820.03ms │
│ Average Time (results) │  467.60ms │
│ Average Time (results) │  454.67ms │
│ Queries Faster         │         3 │
│ Queries Slower         │         1 │
│ Queries with No Change │        11 │
│ Queries with Failure   │         7 │
└────────────────────────┴───────────┘

Are there any user-facing changes?

Copilot

Pull Request Overview

The PR updates several benchmark modules to mark failed queries and continue executing subsequent queries when a failure occurs. Key changes include:

Adding a new "success" field to track query failure status.
Updating benchmark execution in tpch, sort_tpch, imdb, and clickbench to mark failures instead of aborting.
Modifying compare.py to exclude failed queries from average time calculations and reporting the list of failed queries.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
benchmarks/src/util/run.rs	Added "success" field and mark_failed method to track query results.
benchmarks/src/util/mod.rs	Re-exported QueryResult to support new benchmark result structure.
benchmarks/src/tpch/run.rs	Updated query execution loop to handle failures and record failed query IDs.
benchmarks/src/sort_tpch.rs	Adjusted benchmark loop to mark failed queries and print failure details.
benchmarks/src/imdb/run.rs	Applied similar failure handling as in tpch and sort_tpch modules.
benchmarks/src/clickbench.rs	Refactored benchmark_query logic to include failure handling and iterations abstraction.
benchmarks/compare.py	Updated average time calculations and display to exclude failed queries.

benchmarks/compare.py

2010YOUY01

Thank you, this is quite helpful for us to work towards robust memory-limited execution. We need this feature since now many queries are failing under memory limit, if they can stably finish in the future we can consider making it a CI test.

I tested tpch/clickbench/sort_tpch, and first two can run as expected, sort_tpch benchmark seems need a fix in the runner 🤔

Benchmark commands:

cargo run --profile release-nonlto --bin tpch -- benchmark datafusion --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 --prefer_hash_join true --format parquet -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/tpch_sf1.json --memory-limit 1G

cargo run --profile release-nonlto --bin dfbench -- clickbench --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/hits.parquet --queries-path /Users/yongting/Code/datafusion/benchmarks/queries/clickbench/queries.sql -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/clickbench_1.json --memory-limit 1G

 cargo run --profile release-nonlto --bin dfbench -- sort-tpch --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/sort_tpch.json --memory-limit 2G

sort_tpch's error message

ongting@Mac ~/C/d/benchmarks (pr-16182 *)> cargo run --profile release-nonlto --bin dfbench -- sort-tpch --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/sort_tpch.json --memory-limit 2G
    Finished `release-nonlto` profile [optimized] target(s) in 0.12s
     Running `/Users/yongting/Code/datafusion/target/release-nonlto/dfbench sort-tpch --iterations 5 --path /Users/yongting/Code/datafusion/benchmarks/data/tpch_sf10 -o /Users/yongting/Code/datafusion/benchmarks/results/pr-16182/sort_tpch.json --memory-limit 2G`
Q1 iteration 0 took 1645.0 ms and returned 59986052 rows
Q1 iteration 1 took 1674.9 ms and returned 59986052 rows
Q1 iteration 2 took 1635.0 ms and returned 59986052 rows
Q1 iteration 3 took 1672.7 ms and returned 59986052 rows
Q1 iteration 4 took 1604.2 ms and returned 59986052 rows
Q1 avg time: 1646.32 ms
Q2 iteration 0 took 1282.9 ms and returned 59986052 rows
Q2 iteration 1 took 1286.9 ms and returned 59986052 rows
Q2 iteration 2 took 1313.8 ms and returned 59986052 rows
Q2 iteration 3 took 1300.1 ms and returned 59986052 rows
Q2 iteration 4 took 1290.9 ms and returned 59986052 rows
Q2 avg time: 1294.90 ms
Q3 iteration 0 took 6326.5 ms and returned 59986052 rows
Q3 iteration 1 took 6289.8 ms and returned 59986052 rows
Q3 iteration 2 took 6223.8 ms and returned 59986052 rows
Q3 iteration 3 took 6329.6 ms and returned 59986052 rows
Q3 iteration 4 took 6333.3 ms and returned 59986052 rows
Q3 avg time: 6300.60 ms
Q4 iteration 0 took 1574.2 ms and returned 59986052 rows
Q4 iteration 1 took 1578.4 ms and returned 59986052 rows
Q4 iteration 2 took 1586.4 ms and returned 59986052 rows
Q4 iteration 3 took 1608.0 ms and returned 59986052 rows
Q4 iteration 4 took 1578.5 ms and returned 59986052 rows
Q4 avg time: 1585.13 ms

thread 'main' panicked at /Users/yongting/Code/datafusion/benchmarks/src/sort_tpch.rs:312:32:
called `Result::unwrap()` on an `Err` value: ResourcesExhausted("Additional allocation failed with top memory consumers (across reservations) as:\n  ExternalSorterMerge[2]#586(can spill: false) consumed 182.5 MB,\n  ExternalSorterMerge[10]#602(can spill: false) consumed 179.3 MB,\n  ExternalSorterMerge[5]#592(can spill: false) consumed 176.1 MB,\n  ExternalSorterMerge[4]#590(can spill: false) consumed 175.3 MB,\n  ExternalSorterMerge[0]#582(can spill: false) consumed 173.0 MB.\nError: Failed to allocate additional 248.1 KB for ExternalSorterMerge[8] with 0.0 B already allocated for this reservation - 37.8 KB remain available for the total pool")

Other than that I left some minor suggestions, looking forward to your feedback!

benchmarks/src/clickbench.rs

benchmarks/src/util/run.rs

ding-young · 2025-05-31T08:30:41Z

benchmarks/src/sort_tpch.rs

        let mut stream = execute_stream(physical_plan.clone(), state.task_ctx())?;
        while let Some(batch) = stream.next().await {
-            row_count += batch.unwrap().num_rows();
+            row_count += batch?.num_rows();


I replaced unwrap() with ? so that sort_tpch does not terminate early. Thanks for catching that !

ding-young · 2025-05-31T08:34:40Z

Thank you for review. I’ve applied your suggestions, and I’ll run it a few more times and check the output display once again before requesting a final review. If there are any other benchmarks I didn’t touch that would benefit from similar changes, please let me know.

ding-young · 2025-06-03T06:27:07Z

benchmarks/src/util/run.rs

+    pub fn maybe_print_failures(&self) {
+        let failed_queries: Vec<&str> = self
+            .queries
+            .iter()
+            .filter_map(|q| (!q.success).then_some(q.query.as_str()))
+            .collect();
+
+        if !failed_queries.is_empty() {
+            println!("Failed Queries: {}", failed_queries.join(", "));
+        }
+    }


Since we call q.query.as_str here, expected output may varies in benchmarks.
For example,

// In sort_tpch Failed Queries: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 // In clickbench Failed Queries: Query 4, Query 5, Query 7

2010YOUY01

Great! Thanks again.

2010YOUY01 · 2025-06-04T01:56:50Z

benchmarks/src/util/run.rs

+        if let Some(idx) = self.current_case {
+            self.queries[idx].success = false;
+        } else {
+            panic!("Cannot mark failure: no current case");


Suggested change

panic!("Cannot mark failure: no current case");

unreachable!("Cannot mark failure: no current case");

panic usually is used for errors that are possible but unrecoverable. unreachable is better for this case (just an assertion, logically impossible to happen)

Updated :) Thank you

* Move struct QueryResult to util/run.rs * Modify benches to continue query execution even on failure * Mark benchmark query success on output json

… in values array (#16258) * Add tests * trigger ci * Update tpch, clickbench, sort_tpch to mark failed queries (#16182) * Move struct QueryResult to util/run.rs * Modify benches to continue query execution even on failure * Mark benchmark query success on output json * Adjust slttest to pass without RUST_BACKTRACE enabled (#16251) * add more tests where the dict keys are not null but dict values are null * Revert "add more tests where the dict keys are not null but dict values are null" This reverts commit c745dae. * Add tests for count and count_distinct with dictionary arrays containing null values * resolve merge conflict, reorder imports * Add helper function to create dictionary array with non-null keys and null values * Add test for count_distinct accumulator with dictionary array containing all null values * add tests to aggregate.rs * remove redundant comments in get_formatted_results function * remove redundant safety checks in count distinct dictionary test * refactor: introduce helper function for output assertion * test: add count distinct dictionary handling for null values * refactor: streamline imports and improve code organization in count.rs * fix: add missing import for batches_to_string in aggregates.rs * test: reorganize tests * refactor: simplify COUNT(DISTINCT) tests and remove unused helper functions * refactor: trim redundant tests --------- Co-authored-by: ding-young <lsyhime@snu.ac.kr> Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Dmitrii Blaginin <dmitrii@blaginin.me>

ding-young force-pushed the memory-limit-bench branch 2 times, most recently from 2a9d7a8 to e484c52 Compare May 29, 2025 14:39

ding-young marked this pull request as ready for review May 29, 2025 14:51

2010YOUY01 requested a review from Copilot May 31, 2025 04:49

Copilot AI reviewed May 31, 2025

View reviewed changes

benchmarks/compare.py Outdated Show resolved Hide resolved

benchmarks/compare.py Outdated Show resolved Hide resolved

2010YOUY01 reviewed May 31, 2025

View reviewed changes

benchmarks/src/clickbench.rs Outdated Show resolved Hide resolved

benchmarks/src/util/run.rs Show resolved Hide resolved

ding-young force-pushed the memory-limit-bench branch from e484c52 to 3d5ab62 Compare May 31, 2025 08:28

ding-young commented May 31, 2025

View reviewed changes

ding-young commented Jun 3, 2025

View reviewed changes

ding-young requested a review from 2010YOUY01 June 3, 2025 06:30

2010YOUY01 approved these changes Jun 4, 2025

View reviewed changes

ding-young added 3 commits June 4, 2025 05:39

Move struct QueryResult to util/run.rs

4027108

Modify benches to continue query execution even on failure

06e7615

Mark benchmark query success on output json

b6999aa

ding-young force-pushed the memory-limit-bench branch from 3d5ab62 to b6999aa Compare June 4, 2025 05:40

2010YOUY01 merged commit 9ae41b1 into apache:main Jun 5, 2025
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update tpch, clickbench, sort_tpch to mark failed queries #16182

Update tpch, clickbench, sort_tpch to mark failed queries #16182

Uh oh!

ding-young commented May 24, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

ding-young May 31, 2025

Uh oh!

ding-young commented May 31, 2025

Uh oh!

ding-young Jun 3, 2025

Uh oh!

2010YOUY01 left a comment

Uh oh!

2010YOUY01 Jun 4, 2025

Uh oh!

ding-young Jun 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	panic!("Cannot mark failure: no current case");
	unreachable!("Cannot mark failure: no current case");

Update tpch, clickbench, sort_tpch to mark failed queries #16182

Update tpch, clickbench, sort_tpch to mark failed queries #16182

Uh oh!

Conversation

ding-young commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

2010YOUY01 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ding-young May 31, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young commented May 31, 2025

Uh oh!

ding-young Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 left a comment

Choose a reason for hiding this comment

Uh oh!

2010YOUY01 Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

ding-young Jun 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ding-young commented May 24, 2025 •

edited

Loading

2010YOUY01 left a comment •

edited

Loading