This is a set of benchmarks comparing datafusion scans using iceberg-rust
to basic parquet scans (using ListingTable
) and optionally pyiceberg
.
You need the following installed:
uv
rust
Then you can run the benchmarks:
cargo bench
Or, to run with the pyarrow comparison:
uv run --python 3.12 --with 'pyiceberg[s3fs,pyarrow]' \
cargo bench --features pyiceberg
Alternatively, you can just use mise, which also installs the tooling for you:
mise run benchmarks
mise run benchmarks-pyiceberg
Running the python comparison requires AWS credentials in the environment. However, the bucket is public; any principal is fine.
I ran the benchmarks on an m7i.16xlarge
(25g NIC, oodles of CPU/mem). Here are some results. The results have been lightly reformatted for clarity.
====== full scan tpch_10.customer (1500000 rows, 1 files) =======
scan/iceberg-rust/tpch_10.customer
time: [1.7771 s 1.9306 s 2.1040 s]
thrpt: [712.93 Kelem/s 776.95 Kelem/s 844.09 Kelem/s]
scan/datafusion-parquet/tpch_10.customer
time: [964.09 ms 971.88 ms 982.88 ms]
thrpt: [1.5261 Melem/s 1.5434 Melem/s 1.5559 Melem/s]
scan/pyarrow/tpch_10.customer
time: [1.1637 s 1.2835 s 1.4237 s]
thrpt: [1.0536 Melem/s 1.1687 Melem/s 1.2890 Melem/s]
====== full scan tpch_30.customer (4500000 rows, 3 files) =======
scancan/iceberg-rust/tpch_30.customer
time: [3.2304 s 3.3947 s 3.5772 s]
thrpt: [1.2580 Melem/s 1.3256 Melem/s 1.3930 Melem/s]
scan/datafusion-parquet/tpch_30.customer
time: [1.0234 s 1.0300 s 1.0370 s]
thrpt: [4.3395 Melem/s 4.3690 Melem/s 4.3973 Melem/s]
scan/pyarrow/tpch_30.customer
time: [2.0735 s 2.0880 s 2.1040 s]
thrpt: [2.1388 Melem/s 2.1551 Melem/s 2.1702 Melem/s]
====== full scan tpch_50.customer (7500000 rows, 5 files) =======
scan/iceberg-rust/tpch_50.customer
time: [3.6008 s 3.7089 s 3.8237 s]
thrpt: [1.9615 Melem/s 2.0222 Melem/s 2.0828 Melem/s]
scan/datafusion-parquet/tpch_50.customer
time: [1.0703 s 1.1059 s 1.1458 s]
thrpt: [6.5455 Melem/s 6.7815 Melem/s 7.0077 Melem/s]
scan/pyarrow/tpch_50.customer
time: [3.0151 s 3.2415 s 3.3664 s]
thrpt: [2.2279 Melem/s 2.3137 Melem/s 2.4875 Melem/s]
====== full scan tpch_100.customer (15000000 rows, 11 files) =======
scan/iceberg-rust/tpch_100.customer
time: [4.6857 s 4.7790 s 4.8644 s]
thrpt: [3.0837 Melem/s 3.1387 Melem/s 3.2012 Melem/s]
scan/datafusion-parquet/tpch_100.customer
time: [1.0757 s 1.1652 s 1.2894 s]
thrpt: [11.633 Melem/s 12.873 Melem/s 13.944 Melem/s]
scan/pyarrow/tpch_100.customer
time: [3.9924 s 4.2544 s 4.6072 s]
thrpt: [3.2558 Melem/s 3.5258 Melem/s 3.7571 Melem/s]
====== projection scan tpch_10.lineitem (59986052 rows, 31 files) =======
scan-proj/iceberg-rust/tpch_10.lineitem
time: [520.90 ms 571.35 ms 638.16 ms]
thrpt: [93.999 Melem/s 104.99 Melem/s 115.16 Melem/s]
scan-proj/datafusion-parquet/tpch_10.lineitem
time: [643.92 ms 804.56 ms 992.92 ms]
thrpt: [60.414 Melem/s 74.558 Melem/s 93.157 Melem/s]
scan-proj/pyarrow/tpch_10.lineitem
time: [4.8085 s 4.9450 s 5.1402 s]
thrpt: [11.670 Melem/s 12.131 Melem/s 12.475 Melem/s]
====== projection scan tpch_30.lineitem (179998372 rows, 73 files) =======
scan-proj/iceberg-rust/tpch_30.lineitem
time: [1.3909 s 1.4680 s 1.5577 s]
thrpt: [115.55 Melem/s 122.61 Melem/s 129.41 Melem/s]
scan-proj/datafusion-parquet/tpch_30.lineitem
time: [1.0851 s 1.2728 s 1.4682 s]
thrpt: [122.59 Melem/s 141.42 Melem/s 165.88 Melem/s]
scan-proj/pyarrow/tpch_30.lineitem
time: [14.964 s 15.105 s 15.248 s]
thrpt: [11.805 Melem/s 11.917 Melem/s 12.029 Melem/s]
====== projection scan tpch_50.lineitem (300005811 rows, 126 files) =======
scan-proj/iceberg-rust/tpch_50.lineitem
time: [2.1721 s 2.2214 s 2.2856 s]
thrpt: [131.26 Melem/s 135.05 Melem/s 138.12 Melem/s]
scan-proj/datafusion-parquet/tpch_50.lineitem
time: [1.4452 s 1.6551 s 1.8564 s]
thrpt: [161.61 Melem/s 181.26 Melem/s 207.59 Melem/s]
scan-proj/pyarrow/tpch_50.lineitem
time: [24.140 s 24.689 s 25.438 s]
thrpt: [11.793 Melem/s 12.151 Melem/s 12.428 Melem/s]
====== filter scan tpch_1.lineitem (3602799 rows after filter, 3 files) =======
scan-filter-proj/iceberg-rust/tpch_1.lineitem
time: [403.70 ms 414.07 ms 426.11 ms]
thrpt: [8.4551 Melem/s 8.7009 Melem/s 8.9245 Melem/s]
scan-filter-proj/datafusion-parquet/tpch_1.lineitem
time: [346.64 ms 435.62 ms 550.76 ms]
thrpt: [6.5416 Melem/s 8.2706 Melem/s 10.394 Melem/s]
scan-filter-proj/pyarrow/tpch_1.lineitem
time: [1.8434 s 1.9611 s 2.0237 s]
thrpt: [1.7803 Melem/s 1.8371 Melem/s 1.9544 Melem/s]
====== filter scan tpch_10.lineitem (35995193 rows after filter, 31 files) =======
scan-filter-proj/iceberg-rust/tpch_10.lineitem
time: [1.5958 s 1.6472 s 1.7367 s]
thrpt: [20.727 Melem/s 21.852 Melem/s 22.556 Melem/s]
scan-filter-proj/datafusion-parquet/tpch_10.lineitem
time: [507.90 ms 574.33 ms 650.90 ms]
thrpt: [55.301 Melem/s 62.674 Melem/s 70.870 Melem/s]
scan-filter-proj/pyarrow/tpch_10.lineitem
time: [11.458 s 11.528 s 11.603 s]
thrpt: [3.1022 Melem/s 3.1223 Melem/s 3.1414 Melem/s]
====== filter scan tpch_30.lineitem (108004317 rows after filter, 73 files) =======
scan-filter-proj/iceberg-rust/tpch_30.lineitem
time: [4.7494 s 4.8003 s 4.8586 s]
thrpt: [22.230 Melem/s 22.500 Melem/s 22.741 Melem/s]
scan-filter-proj/datafusion-parquet/tpch_30.lineitem
time: [847.40 ms 978.37 ms 1.1115 s]
thrpt: [97.172 Melem/s 110.39 Melem/s 127.45 Melem/s]
scan-filter-proj/pyarrow/tpch_30.lineitem
time: [35.479 s 35.661 s 35.877 s]
thrpt: [3.0104 Melem/s 3.0287 Melem/s 3.0441 Melem/s]
====== partition scan default.taxi_fhvhv_partitioned (19132122 rows after filter, 3283 files) =======
scan-filter-part/iceberg-rust/2023-07-01 00:00:00Z
time: [202.33 s 202.60 s 202.87 s]
thrpt: [94.308 Kelem/s 94.433 Kelem/s 94.557 Kelem/s]
scan-filter-part/datafusion-parquet/2023-07-01 00:00:00Z
time: [12.346 s 13.410 s 15.301 s]
thrpt: [1.2504 Melem/s 1.4267 Melem/s 1.5497 Melem/s]
scan-filter-part/pyarrow/2023-07-01T00:00:00+00:00
time: [1.5906 s 1.6076 s 1.6228 s]
thrpt: [11.789 Melem/s 11.901 Melem/s 12.028 Melem/s]
Important
TODO: this benchmark did not complete in a reasonable timeframe :(
====== limit scan tpch_100.lineitem (100_000 rows, 248 files) ======
scan-limit/iceberg-rust/100_000
time: [833.65 ms 845.67 ms 857.68 ms]
thrpt: [116.59 Kelem/s 118.25 Kelem/s 119.95 Kelem/s]
scan-limit/datafusion-parquet/100_000
time: [369.68 ms 389.58 ms 408.36 ms]
thrpt: [244.88 Kelem/s 256.69 Kelem/s 270.50 Kelem/s]
scan-limit/pyarrow/100_000
time: [7.7994 s 7.9748 s 8.1733 s]
thrpt: [12.235 Kelem/s 12.539 Kelem/s 12.822 Kelem/s]
====== limit scan tpch_100.lineitem (1_000_000 rows, 248 files) ======
scan-limit/iceberg-rust/1_000_000
time: [1.1342 s 1.1620 s 1.1907 s]
thrpt: [839.82 Kelem/s 860.62 Kelem/s 881.69 Kelem/s]
scan-limit/datafusion-parquet/1_000_000
time: [550.65 ms 556.58 ms 562.48 ms]
thrpt: [1.7778 Melem/s 1.7967 Melem/s 1.8160 Melem/s]
scan-limit/pyarrow/1_000_000
time: [7.7004 s 7.8288 s 7.9564 s]
thrpt: [125.68 Kelem/s 127.73 Kelem/s 129.86 Kelem/s]
====== limit scan tpch_100.lineitem (10_000_000 rows, 248 files) ======
scan-limit/iceberg-rust/10_000_000
time: [3.0728 s 3.1072 s 3.1394 s]
thrpt: [3.1853 Melem/s 3.2183 Melem/s 3.2543 Melem/s]
scan-limit/datafusion-parquet/10_000_000
time: [684.89 ms 706.70 ms 737.87 ms]
thrpt: [13.553 Melem/s 14.150 Melem/s 14.601 Melem/s]
scan-limit/pyarrow/10_000_000
time: [8.4940 s 8.6290 s 8.7932 s]
thrpt: [1.1372 Melem/s 1.1589 Melem/s 1.1773 Melem/s]
====== limit scan tpch_100.lineitem (100_000_000 rows, 248 files) ======
scan-limit/iceberg-rust/100_000_000
time: [20.476 s 20.550 s 20.673 s]
thrpt: [4.8371 Melem/s 4.8661 Melem/s 4.8838 Melem/s]
scan-limit/datafusion-parquet/100_000_000
time: [1.8163 s 1.8976 s 1.9770 s]
thrpt: [50.582 Melem/s 52.699 Melem/s 55.058 Melem/s]
scan-limit/pyarrow/100_000_000
time: [17.738 s 17.859 s 17.977 s]
thrpt: [5.5627 Melem/s 5.5993 Melem/s 5.6375 Melem/s]