Skip to content

colinmarc/iceberg-datafusion-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Iceberg-rust benchmarks

This is a set of benchmarks comparing datafusion scans using iceberg-rust to basic parquet scans (using ListingTable) and optionally pyiceberg.

Running the benchmarks

You need the following installed:

uv
rust

Then you can run the benchmarks:

cargo bench

Or, to run with the pyarrow comparison:

uv run --python 3.12 --with 'pyiceberg[s3fs,pyarrow]' \
  cargo bench --features pyiceberg

Alternatively, you can just use mise, which also installs the tooling for you:

mise run benchmarks
mise run benchmarks-pyiceberg

AWS Credentials

Running the python comparison requires AWS credentials in the environment. However, the bucket is public; any principal is fine.

Results

I ran the benchmarks on an m7i.16xlarge (25g NIC, oodles of CPU/mem). Here are some results. The results have been lightly reformatted for clarity.

SELECT * FROM tpch_xx.customer

====== full scan tpch_10.customer (1500000 rows, 1 files) =======
scan/iceberg-rust/tpch_10.customer
                        time:   [1.7771 s 1.9306 s 2.1040 s]
                        thrpt:  [712.93 Kelem/s 776.95 Kelem/s 844.09 Kelem/s]
scan/datafusion-parquet/tpch_10.customer
                        time:   [964.09 ms 971.88 ms 982.88 ms]
                        thrpt:  [1.5261 Melem/s 1.5434 Melem/s 1.5559 Melem/s]
scan/pyarrow/tpch_10.customer
                        time:   [1.1637 s 1.2835 s 1.4237 s]
                        thrpt:  [1.0536 Melem/s 1.1687 Melem/s 1.2890 Melem/s]
====== full scan tpch_30.customer (4500000 rows, 3 files) =======
scancan/iceberg-rust/tpch_30.customer
                        time:   [3.2304 s 3.3947 s 3.5772 s]
                        thrpt:  [1.2580 Melem/s 1.3256 Melem/s 1.3930 Melem/s]
scan/datafusion-parquet/tpch_30.customer
                        time:   [1.0234 s 1.0300 s 1.0370 s]
                        thrpt:  [4.3395 Melem/s 4.3690 Melem/s 4.3973 Melem/s]
scan/pyarrow/tpch_30.customer
                        time:   [2.0735 s 2.0880 s 2.1040 s]
                        thrpt:  [2.1388 Melem/s 2.1551 Melem/s 2.1702 Melem/s]
====== full scan tpch_50.customer (7500000 rows, 5 files) =======
scan/iceberg-rust/tpch_50.customer
                        time:   [3.6008 s 3.7089 s 3.8237 s]
                        thrpt:  [1.9615 Melem/s 2.0222 Melem/s 2.0828 Melem/s]
scan/datafusion-parquet/tpch_50.customer
                        time:   [1.0703 s 1.1059 s 1.1458 s]
                        thrpt:  [6.5455 Melem/s 6.7815 Melem/s 7.0077 Melem/s]
scan/pyarrow/tpch_50.customer
                        time:   [3.0151 s 3.2415 s 3.3664 s]
                        thrpt:  [2.2279 Melem/s 2.3137 Melem/s 2.4875 Melem/s]
====== full scan tpch_100.customer (15000000 rows, 11 files) =======
scan/iceberg-rust/tpch_100.customer
                        time:   [4.6857 s 4.7790 s 4.8644 s]
                        thrpt:  [3.0837 Melem/s 3.1387 Melem/s 3.2012 Melem/s]
scan/datafusion-parquet/tpch_100.customer
                        time:   [1.0757 s 1.1652 s 1.2894 s]
                        thrpt:  [11.633 Melem/s 12.873 Melem/s 13.944 Melem/s]
scan/pyarrow/tpch_100.customer
                        time:   [3.9924 s 4.2544 s 4.6072 s]
                        thrpt:  [3.2558 Melem/s 3.5258 Melem/s 3.7571 Melem/s]

SELECT l_orderkey FROM tpch_xx.lineitem

====== projection scan tpch_10.lineitem (59986052 rows, 31 files) =======
scan-proj/iceberg-rust/tpch_10.lineitem
                        time:   [520.90 ms 571.35 ms 638.16 ms]
                        thrpt:  [93.999 Melem/s 104.99 Melem/s 115.16 Melem/s]
scan-proj/datafusion-parquet/tpch_10.lineitem
                        time:   [643.92 ms 804.56 ms 992.92 ms]
                        thrpt:  [60.414 Melem/s 74.558 Melem/s 93.157 Melem/s]
scan-proj/pyarrow/tpch_10.lineitem
                        time:   [4.8085 s 4.9450 s 5.1402 s]
                        thrpt:  [11.670 Melem/s 12.131 Melem/s 12.475 Melem/s]
====== projection scan tpch_30.lineitem (179998372 rows, 73 files) =======
scan-proj/iceberg-rust/tpch_30.lineitem
                        time:   [1.3909 s 1.4680 s 1.5577 s]
                        thrpt:  [115.55 Melem/s 122.61 Melem/s 129.41 Melem/s]
scan-proj/datafusion-parquet/tpch_30.lineitem
                        time:   [1.0851 s 1.2728 s 1.4682 s]
                        thrpt:  [122.59 Melem/s 141.42 Melem/s 165.88 Melem/s]
scan-proj/pyarrow/tpch_30.lineitem
                        time:   [14.964 s 15.105 s 15.248 s]
                        thrpt:  [11.805 Melem/s 11.917 Melem/s 12.029 Melem/s]
====== projection scan tpch_50.lineitem (300005811 rows, 126 files) =======
scan-proj/iceberg-rust/tpch_50.lineitem
                        time:   [2.1721 s 2.2214 s 2.2856 s]
                        thrpt:  [131.26 Melem/s 135.05 Melem/s 138.12 Melem/s]
scan-proj/datafusion-parquet/tpch_50.lineitem
                        time:   [1.4452 s 1.6551 s 1.8564 s]
                        thrpt:  [161.61 Melem/s 181.26 Melem/s 207.59 Melem/s]
scan-proj/pyarrow/tpch_50.lineitem
                        time:   [24.140 s 24.689 s 25.438 s]
                        thrpt:  [11.793 Melem/s 12.151 Melem/s 12.428 Melem/s]

SELECT l_orderkey, l_quantity FROM tpch_xx.lineitem WHERE l_quantity > 20

====== filter scan tpch_1.lineitem (3602799 rows after filter, 3 files) =======
scan-filter-proj/iceberg-rust/tpch_1.lineitem
                        time:   [403.70 ms 414.07 ms 426.11 ms]
                        thrpt:  [8.4551 Melem/s 8.7009 Melem/s 8.9245 Melem/s]
scan-filter-proj/datafusion-parquet/tpch_1.lineitem
                        time:   [346.64 ms 435.62 ms 550.76 ms]
                        thrpt:  [6.5416 Melem/s 8.2706 Melem/s 10.394 Melem/s]
scan-filter-proj/pyarrow/tpch_1.lineitem
                        time:   [1.8434 s 1.9611 s 2.0237 s]
                        thrpt:  [1.7803 Melem/s 1.8371 Melem/s 1.9544 Melem/s]
====== filter scan tpch_10.lineitem (35995193 rows after filter, 31 files) =======
scan-filter-proj/iceberg-rust/tpch_10.lineitem
                        time:   [1.5958 s 1.6472 s 1.7367 s]
                        thrpt:  [20.727 Melem/s 21.852 Melem/s 22.556 Melem/s]
scan-filter-proj/datafusion-parquet/tpch_10.lineitem
                        time:   [507.90 ms 574.33 ms 650.90 ms]
                        thrpt:  [55.301 Melem/s 62.674 Melem/s 70.870 Melem/s]
scan-filter-proj/pyarrow/tpch_10.lineitem
                        time:   [11.458 s 11.528 s 11.603 s]
                        thrpt:  [3.1022 Melem/s 3.1223 Melem/s 3.1414 Melem/s]
====== filter scan tpch_30.lineitem (108004317 rows after filter, 73 files) =======
scan-filter-proj/iceberg-rust/tpch_30.lineitem
                        time:   [4.7494 s 4.8003 s 4.8586 s]
                        thrpt:  [22.230 Melem/s 22.500 Melem/s 22.741 Melem/s]
scan-filter-proj/datafusion-parquet/tpch_30.lineitem
                        time:   [847.40 ms 978.37 ms 1.1115 s]
                        thrpt:  [97.172 Melem/s 110.39 Melem/s 127.45 Melem/s]
scan-filter-proj/pyarrow/tpch_30.lineitem
                        time:   [35.479 s 35.661 s 35.877 s]
                        thrpt:  [3.0104 Melem/s 3.0287 Melem/s 3.0441 Melem/s]

SELECT * FROM taxi_fhvhv WHERE pickup_datetime > xxx

====== partition scan default.taxi_fhvhv_partitioned (19132122 rows after filter, 3283 files) =======
scan-filter-part/iceberg-rust/2023-07-01 00:00:00Z
                        time:   [202.33 s 202.60 s 202.87 s]
                        thrpt:  [94.308 Kelem/s 94.433 Kelem/s 94.557 Kelem/s]
scan-filter-part/datafusion-parquet/2023-07-01 00:00:00Z
                        time:   [12.346 s 13.410 s 15.301 s]
                        thrpt:  [1.2504 Melem/s 1.4267 Melem/s 1.5497 Melem/s]
scan-filter-part/pyarrow/2023-07-01T00:00:00+00:00
                        time:   [1.5906 s 1.6076 s 1.6228 s]
                        thrpt:  [11.789 Melem/s 11.901 Melem/s 12.028 Melem/s]

Important

TODO: this benchmark did not complete in a reasonable timeframe :(

SELECT * FROM tpch_100.lineitem LIMIT xxx

====== limit scan tpch_100.lineitem (100_000 rows, 248 files) ======
scan-limit/iceberg-rust/100_000
                        time:   [833.65 ms 845.67 ms 857.68 ms]
                        thrpt:  [116.59 Kelem/s 118.25 Kelem/s 119.95 Kelem/s]
scan-limit/datafusion-parquet/100_000
                        time:   [369.68 ms 389.58 ms 408.36 ms]
                        thrpt:  [244.88 Kelem/s 256.69 Kelem/s 270.50 Kelem/s]
scan-limit/pyarrow/100_000
                        time:   [7.7994 s 7.9748 s 8.1733 s]
                        thrpt:  [12.235 Kelem/s 12.539 Kelem/s 12.822 Kelem/s]
====== limit scan tpch_100.lineitem (1_000_000 rows, 248 files) ======
scan-limit/iceberg-rust/1_000_000
                        time:   [1.1342 s 1.1620 s 1.1907 s]
                        thrpt:  [839.82 Kelem/s 860.62 Kelem/s 881.69 Kelem/s]
scan-limit/datafusion-parquet/1_000_000
                        time:   [550.65 ms 556.58 ms 562.48 ms]
                        thrpt:  [1.7778 Melem/s 1.7967 Melem/s 1.8160 Melem/s]
scan-limit/pyarrow/1_000_000
                        time:   [7.7004 s 7.8288 s 7.9564 s]
                        thrpt:  [125.68 Kelem/s 127.73 Kelem/s 129.86 Kelem/s]
====== limit scan tpch_100.lineitem (10_000_000 rows, 248 files) ======
scan-limit/iceberg-rust/10_000_000
                        time:   [3.0728 s 3.1072 s 3.1394 s]
                        thrpt:  [3.1853 Melem/s 3.2183 Melem/s 3.2543 Melem/s]
scan-limit/datafusion-parquet/10_000_000
                        time:   [684.89 ms 706.70 ms 737.87 ms]
                        thrpt:  [13.553 Melem/s 14.150 Melem/s 14.601 Melem/s]
scan-limit/pyarrow/10_000_000
                        time:   [8.4940 s 8.6290 s 8.7932 s]
                        thrpt:  [1.1372 Melem/s 1.1589 Melem/s 1.1773 Melem/s]
====== limit scan tpch_100.lineitem (100_000_000 rows, 248 files) ======
scan-limit/iceberg-rust/100_000_000
                        time:   [20.476 s 20.550 s 20.673 s]
                        thrpt:  [4.8371 Melem/s 4.8661 Melem/s 4.8838 Melem/s]
scan-limit/datafusion-parquet/100_000_000
                        time:   [1.8163 s 1.8976 s 1.9770 s]
                        thrpt:  [50.582 Melem/s 52.699 Melem/s 55.058 Melem/s]
scan-limit/pyarrow/100_000_000
                        time:   [17.738 s 17.859 s 17.977 s]
                        thrpt:  [5.5627 Melem/s 5.5993 Melem/s 5.6375 Melem/s]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages