Skip to content

bug: TPCH 18 query hangs #21625

@Omega359

Description

@Omega359

Describe the bug

On my machine

OS: Fedora Linux 43 (KDE Plasma Desktop Edition) x86_64
Kernel: Linux 6.19.11-200.fc43.x86_64
CPU: AMD Ryzen AI 9 HX 370 (24) @ 5.16 GHz
GPU: AMD Radeon 890M Graphics [Integrated]
Memory: 11.96 GiB / 86.02 GiB (14%)
Swap: 0 B / 8.00 GiB (0%)

with updated rust:

$ rustup show
Default host: x86_64-unknown-linux-gnu
rustup home:  /home/bruce/.rustup

installed toolchains
--------------------
stable-x86_64-unknown-linux-gnu (default)
nightly-x86_64-unknown-linux-gnu
1.91.0-x86_64-unknown-linux-gnu
1.92.0-x86_64-unknown-linux-gnu
1.93.0-x86_64-unknown-linux-gnu
1.94.0-x86_64-unknown-linux-gnu (active)

active toolchain
----------------
name: 1.94.0-x86_64-unknown-linux-gnu
active because: overridden by '/home/bruce/dev/datafusion2/rust-toolchain.toml'
installed targets:
  x86_64-unknown-linux-gnu

Running against main:
cd benchmarks;./bench.sh data tpch;./bench.sh run tpch 18 will hang

$ ./bench.sh run tpch 18
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: tpch
QUERY: 18
DATAFUSION_DIR: /home/bruce/dev/datafusion2/benchmarks/..
BRANCH_NAME: HEAD
DATA_DIR: /home/bruce/dev/datafusion2/benchmarks/data
RESULTS_DIR: /home/bruce/dev/datafusion2/benchmarks/results/HEAD
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
SIMULATE_LATENCY: false
***************************
RESULTS_FILE: /home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json
Running tpch benchmark...
+ cargo run --release --bin dfbench -- tpch --iterations 5 --path /home/bruce/dev/datafusion2/benchmarks/data/tpch_sf1 --prefer_hash_join true --format parquet -o /home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json --query 18
    Finished `release` profile [optimized] target(s) in 0.11s
     Running `/home/bruce/dev/datafusion2/target/release/dfbench tpch --iterations 5 --path /home/bruce/dev/datafusion2/benchmarks/data/tpch_sf1 --prefer_hash_join true --format parquet -o /home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json --query 18`
Running benchmarks with the following options: RunOpt { query: Some(18), common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false, simulate_latency: false }, path: "/home/bruce/dev/datafusion2/benchmarks/data/tpch_sf1", file_format: "parquet", mem_table: false, output_path: Some("/home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json"), disable_statistics: false, prefer_hash_join: true, enable_piecewise_merge_join: false, sorted: false, hash_join_buffering_capacity: 0 }

git bisect points to this commit as the cause. Running the test at the commit just prior to that one succeeds. Running it at that commit fails.

If prefer_hash_join is disabled the query will run as expected:

PREFER_HASH_JOIN=false ./bench.sh run tpch 18
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: tpch
QUERY: 18
DATAFUSION_DIR: /home/bruce/dev/datafusion2/benchmarks/..
BRANCH_NAME: HEAD
DATA_DIR: /home/bruce/dev/datafusion2/benchmarks/data
RESULTS_DIR: /home/bruce/dev/datafusion2/benchmarks/results/HEAD
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: false
SIMULATE_LATENCY: false
***************************
RESULTS_FILE: /home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json
Running tpch benchmark...
+ cargo run --release --bin dfbench -- tpch --iterations 5 --path /home/bruce/dev/datafusion2/benchmarks/data/tpch_sf1 --prefer_hash_join false --format parquet -o /home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json --query 18
    Finished `release` profile [optimized] target(s) in 0.15s
     Running `/home/bruce/dev/datafusion2/target/release/dfbench tpch --iterations 5 --path /home/bruce/dev/datafusion2/benchmarks/data/tpch_sf1 --prefer_hash_join false --format parquet -o /home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json --query 18`
Running benchmarks with the following options: RunOpt { query: Some(18), common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false, simulate_latency: false }, path: "/home/bruce/dev/datafusion2/benchmarks/data/tpch_sf1", file_format: "parquet", mem_table: false, output_path: Some("/home/bruce/dev/datafusion2/benchmarks/results/HEAD/tpch_sf1.json"), disable_statistics: false, prefer_hash_join: false, enable_piecewise_merge_join: false, sorted: false, hash_join_buffering_capacity: 0 }
Query 18 iteration 0 took 206.5 ms and returned 57 rows
Query 18 iteration 1 took 190.0 ms and returned 57 rows
Query 18 iteration 2 took 188.5 ms and returned 57 rows
Query 18 iteration 3 took 185.0 ms and returned 57 rows
Query 18 iteration 4 took 192.1 ms and returned 57 rows
Query 18 avg time: 192.42 ms
+ set +x
Done

To Reproduce

This seems to be machine/OS specific. I've been unable to reproduce on other machines.

Expected behavior

No response

Additional context

No response

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions