Skip to content

feat: Support tpch and tpch10 benchmark for csv format #16373

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jun 12, 2025

Conversation

zhuqi-lucas
Copy link
Contributor

@zhuqi-lucas zhuqi-lucas commented Jun 11, 2025

Which issue does this PR close?

Rationale for this change

  1. tpch data generate for csv format
  2. Support tpch and tpch10 csv format

What changes are included in this PR?

Are these changes tested?

Yes:

./benchmarks/bench.sh data tpch
***************************
DataFusion Benchmark Runner and Data Generator
COMMAND: data
BENCHMARK: tpch
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
Creating tpch dataset at Scale Factor 1 in /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1...
 tbl files exist (/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/supplier.tbl exists).
 Expected answers exist (/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/answers/q1.out exists).
 parquet files exist (/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/supplier exists).
 creating csv files using benchmark binary ...
    Finished `release` profile [optimized] target(s) in 0.13s
     Running `/Users/zhuqi/arrow-datafusion/target/release/tpch convert --input /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1 --output /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv --format csv`
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/part.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/part'
Conversion completed in 28 ms
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/supplier.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/supplier'
Conversion completed in 5 ms
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/partsupp.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/partsupp'
Conversion completed in 43 ms
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/customer.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/customer'
Conversion completed in 13 ms
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/orders.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/orders'
Conversion completed in 129 ms
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/lineitem.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/lineitem'
Conversion completed in 842 ms
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/nation.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/nation'
Conversion completed in 0 ms
Converting '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/region.tbl' to csv files in directory '/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1/csv/region'
Conversion completed in 0 ms




./benchmarks/bench.sh run tpch_csv
***************************
DataFusion Benchmark Script
COMMAND: run
BENCHMARK: tpch_csv
DATAFUSION_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/..
BRANCH_NAME: issue_16370
DATA_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/data
RESULTS_DIR: /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_16370
CARGO_COMMAND: cargo run --release
PREFER_HASH_JOIN: true
***************************
RESULTS_FILE: /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_16370/tpch_sf1.json
Running tpch benchmark...
+ cargo run --release --bin tpch -- benchmark datafusion --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1 --prefer_hash_join true --format csv -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_16370/tpch_sf1.json
   Compiling datafusion-benchmarks v48.0.0 (/Users/zhuqi/arrow-datafusion/benchmarks)
    Finished `release` profile [optimized] target(s) in 4m 09s
     Running `/Users/zhuqi/arrow-datafusion/target/release/tpch benchmark datafusion --iterations 5 --path /Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1 --prefer_hash_join true --format csv -o /Users/zhuqi/arrow-datafusion/benchmarks/results/issue_16370/tpch_sf1.json`
Running benchmarks with the following options: RunOpt { query: None, common: CommonOpt { iterations: 5, partitions: None, batch_size: None, mem_pool_type: "fair", memory_limit: None, sort_spill_reservation_bytes: None, debug: false }, path: "/Users/zhuqi/arrow-datafusion/benchmarks/data/tpch_sf1", file_format: "csv", mem_table: false, output_path: Some("/Users/zhuqi/arrow-datafusion/benchmarks/results/issue_16370/tpch_sf1.json"), disable_statistics: false, prefer_hash_join: true, sorted: false }
Query 1 iteration 0 took 180.5 ms and returned 4 rows
Query 1 iteration 1 took 139.6 ms and returned 4 rows
Query 1 iteration 2 took 140.9 ms and returned 4 rows
Query 1 iteration 3 took 149.2 ms and returned 4 rows
Query 1 iteration 4 took 150.5 ms and returned 4 rows
Query 1 avg time: 152.13 ms
Query 2 iteration 0 took 107.7 ms and returned 100 rows
Query 2 iteration 1 took 56.1 ms and returned 100 rows
Query 2 iteration 2 took 58.8 ms and returned 100 rows
Query 2 iteration 3 took 60.3 ms and returned 100 rows
Query 2 iteration 4 took 59.4 ms and returned 100 rows
Query 2 avg time: 68.46 ms
Query 3 iteration 0 took 215.4 ms and returned 10 rows
Query 3 iteration 1 took 160.1 ms and returned 10 rows
Query 3 iteration 2 took 159.1 ms and returned 10 rows
Query 3 iteration 3 took 159.7 ms and returned 10 rows
Query 3 iteration 4 took 145.5 ms and returned 10 rows
Query 3 avg time: 167.95 ms
Query 4 iteration 0 took 136.1 ms and returned 5 rows
Query 4 iteration 1 took 149.2 ms and returned 5 rows
Query 4 iteration 2 took 150.4 ms and returned 5 rows
Query 4 iteration 3 took 141.0 ms and returned 5 rows
Query 4 iteration 4 took 146.6 ms and returned 5 rows
Query 4 avg time: 144.65 ms
Query 5 iteration 0 took 157.7 ms and returned 5 rows
Query 5 iteration 1 took 174.8 ms and returned 5 rows
Query 5 iteration 2 took 161.4 ms and returned 5 rows
Query 5 iteration 3 took 166.6 ms and returned 5 rows
Query 5 iteration 4 took 160.1 ms and returned 5 rows
Query 5 avg time: 164.10 ms
Query 6 iteration 0 took 107.9 ms and returned 1 rows
Query 6 iteration 1 took 108.6 ms and returned 1 rows
Query 6 iteration 2 took 104.4 ms and returned 1 rows
Query 6 iteration 3 took 101.9 ms and returned 1 rows
Query 6 iteration 4 took 105.6 ms and returned 1 rows
Query 6 avg time: 105.67 ms
Query 7 iteration 0 took 185.4 ms and returned 4 rows
Query 7 iteration 1 took 181.6 ms and returned 4 rows
Query 7 iteration 2 took 195.2 ms and returned 4 rows
Query 7 iteration 3 took 184.4 ms and returned 4 rows
Query 7 iteration 4 took 190.6 ms and returned 4 rows
Query 7 avg time: 187.46 ms
Query 8 iteration 0 took 173.1 ms and returned 2 rows
Query 8 iteration 1 took 189.2 ms and returned 2 rows
Query 8 iteration 2 took 195.0 ms and returned 2 rows
Query 8 iteration 3 took 168.1 ms and returned 2 rows
Query 8 iteration 4 took 178.3 ms and returned 2 rows
Query 8 avg time: 180.74 ms
Query 9 iteration 0 took 197.3 ms and returned 175 rows
Query 9 iteration 1 took 198.0 ms and returned 175 rows
Query 9 iteration 2 took 209.2 ms and returned 175 rows
Query 9 iteration 3 took 197.1 ms and returned 175 rows
Query 9 iteration 4 took 198.7 ms and returned 175 rows
Query 9 avg time: 200.05 ms
Query 10 iteration 0 took 151.8 ms and returned 20 rows
Query 10 iteration 1 took 151.8 ms and returned 20 rows
Query 10 iteration 2 took 145.3 ms and returned 20 rows
Query 10 iteration 3 took 152.6 ms and returned 20 rows
Query 10 iteration 4 took 146.4 ms and returned 20 rows
Query 10 avg time: 149.59 ms
Query 11 iteration 0 took 38.7 ms and returned 1048 rows
Query 11 iteration 1 took 38.2 ms and returned 1048 rows
Query 11 iteration 2 took 36.3 ms and returned 1048 rows
Query 11 iteration 3 took 38.1 ms and returned 1048 rows
Query 11 iteration 4 took 37.6 ms and returned 1048 rows
Query 11 avg time: 37.76 ms
Query 12 iteration 0 took 157.6 ms and returned 2 rows
Query 12 iteration 1 took 160.4 ms and returned 2 rows
Query 12 iteration 2 took 151.1 ms and returned 2 rows
Query 12 iteration 3 took 154.6 ms and returned 2 rows
Query 12 iteration 4 took 156.4 ms and returned 2 rows
Query 12 avg time: 156.03 ms
Query 13 iteration 0 took 36.5 ms and returned 42 rows
Query 13 iteration 1 took 42.3 ms and returned 42 rows
Query 13 iteration 2 took 41.6 ms and returned 42 rows
Query 13 iteration 3 took 40.7 ms and returned 42 rows
Query 13 iteration 4 took 39.4 ms and returned 42 rows
Query 13 avg time: 40.10 ms
Query 14 iteration 0 took 123.3 ms and returned 1 rows
Query 14 iteration 1 took 113.5 ms and returned 1 rows
Query 14 iteration 2 took 115.6 ms and returned 1 rows
Query 14 iteration 3 took 114.0 ms and returned 1 rows
Query 14 iteration 4 took 112.8 ms and returned 1 rows
Query 14 avg time: 115.81 ms
Query 15 iteration 0 took 226.6 ms and returned 1 rows
Query 15 iteration 1 took 228.9 ms and returned 1 rows
Query 15 iteration 2 took 224.0 ms and returned 1 rows
Query 15 iteration 3 took 215.3 ms and returned 1 rows
Query 15 iteration 4 took 222.8 ms and returned 1 rows
Query 15 avg time: 223.50 ms
Query 16 iteration 0 took 31.6 ms and returned 18314 rows
Query 16 iteration 1 took 32.7 ms and returned 18314 rows
Query 16 iteration 2 took 30.8 ms and returned 18314 rows
Query 16 iteration 3 took 29.7 ms and returned 18314 rows
Query 16 iteration 4 took 31.1 ms and returned 18314 rows
Query 16 avg time: 31.16 ms
Query 17 iteration 0 took 232.8 ms and returned 1 rows
Query 17 iteration 1 took 237.4 ms and returned 1 rows
Query 17 iteration 2 took 233.0 ms and returned 1 rows
Query 17 iteration 3 took 227.6 ms and returned 1 rows
Query 17 iteration 4 took 224.8 ms and returned 1 rows
Query 17 avg time: 231.12 ms
Query 18 iteration 0 took 293.1 ms and returned 57 rows
Query 18 iteration 1 took 300.8 ms and returned 57 rows
Query 18 iteration 2 took 288.7 ms and returned 57 rows
Query 18 iteration 3 took 302.8 ms and returned 57 rows
Query 18 iteration 4 took 271.0 ms and returned 57 rows
Query 18 avg time: 291.28 ms
Query 19 iteration 0 took 125.4 ms and returned 1 rows
Query 19 iteration 1 took 121.7 ms and returned 1 rows
Query 19 iteration 2 took 118.0 ms and returned 1 rows
Query 19 iteration 3 took 135.7 ms and returned 1 rows
Query 19 iteration 4 took 117.1 ms and returned 1 rows
Query 19 avg time: 123.59 ms
Query 20 iteration 0 took 140.2 ms and returned 186 rows
Query 20 iteration 1 took 152.0 ms and returned 186 rows
Query 20 iteration 2 took 145.3 ms and returned 186 rows
Query 20 iteration 3 took 137.1 ms and returned 186 rows
Query 20 iteration 4 took 138.7 ms and returned 186 rows
Query 20 avg time: 142.67 ms
Query 21 iteration 0 took 416.1 ms and returned 100 rows
Query 21 iteration 1 took 403.9 ms and returned 100 rows
Query 21 iteration 2 took 416.0 ms and returned 100 rows
Query 21 iteration 3 took 405.2 ms and returned 100 rows
Query 21 iteration 4 took 417.6 ms and returned 100 rows
Query 21 avg time: 411.78 ms
Query 22 iteration 0 took 38.1 ms and returned 7 rows
Query 22 iteration 1 took 37.0 ms and returned 7 rows
Query 22 iteration 2 took 36.4 ms and returned 7 rows
Query 22 iteration 3 took 37.8 ms and returned 7 rows
Query 22 iteration 4 took 36.5 ms and returned 7 rows
Query 22 avg time: 37.15 ms
+ set +x
Done

Are there any user-facing changes?

@zhuqi-lucas zhuqi-lucas changed the title feat: Support tpch and tpch10 csv format feat: Support tpch and tpch10 benchmark for csv format Jun 11, 2025
Copy link
Contributor

@2010YOUY01 2010YOUY01 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested locally and it LGTM, thanks!

@zhuqi-lucas
Copy link
Contributor Author

Thank you @2010YOUY01 for review!

@alamb alamb merged commit 31c570e into apache:main Jun 12, 2025
30 checks passed
@alamb
Copy link
Contributor

alamb commented Jun 12, 2025

Nice -- thank you @zhuqi-lucas and @2010YOUY01

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add tpch csv support to bench.sh
3 participants