Skip to content

Conversation

@andygrove
Copy link
Member

Summary

  • Add scripts to benchmark TPC-H queries against Iceberg tables using Comet's native iceberg-rust integration
  • create-iceberg-tpch.py: Convert Parquet TPC-H data to Iceberg tables
  • tpcbench-iceberg.py: Run TPC-H queries against Iceberg catalog tables
  • comet-tpch-iceberg.sh: Shell script to run the benchmark with Comet
  • Update README.md with Iceberg benchmarking documentation

Test plan

  • Run create-iceberg-tpch.py to create Iceberg tables from Parquet data
  • Run comet-tpch-iceberg.sh and verify CometIcebergNativeScanExec appears in plans
  • Compare benchmark results between Parquet and Iceberg formats

🤖 Generated with Claude Code

Add scripts to benchmark TPC-H queries against Iceberg tables using
Comet's native iceberg-rust integration:

- create-iceberg-tpch.py: Convert Parquet TPC-H data to Iceberg tables
- tpcbench-iceberg.py: Run TPC-H queries against Iceberg catalog tables
- comet-tpch-iceberg.sh: Shell script to run the benchmark with Comet

Also updates README.md with Iceberg benchmarking documentation.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title [WIP] Add Iceberg TPC-H benchmarking scripts chore: Add Iceberg TPC-H benchmarking scripts [WIP] Jan 27, 2026
andygrove and others added 3 commits January 27, 2026 13:32
…h.py

Merge tpcbench-iceberg.py into tpcbench.py using mutually exclusive args:
- --data for Parquet files
- --catalog/--database for Iceberg tables

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove changed the title chore: Add Iceberg TPC-H benchmarking scripts [WIP] chore: Add Iceberg TPC-H benchmarking scripts Jan 27, 2026
@andygrove andygrove marked this pull request as ready for review January 27, 2026 20:45
@andygrove
Copy link
Member Author

@mbutrovich I can now run TPC-H w/ Iceberg native scan locally

Resolve conflict in tpcbench.py by combining:
- Upstream: --format and --options for multiple file formats
- Branch: --catalog and --database for Iceberg tables

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

$SPARK_HOME/bin/spark-submit \
--master $SPARK_MASTER \
--jars $ICEBERG_JAR \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work either way, but this doesn't match the usage in create-iceberg-tpch.py. There we use --packages, here we're defining the jar. Both should work, I think, but maybe best to be consistent.

--conf spark.cores.max=8 \
--conf spark.executor.memory=16g \
--conf spark.sql.catalog.local=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.local.type=hadoop \
Copy link
Contributor

@mbutrovich mbutrovich Jan 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hardcodes the catalog. Above you have ICEBERG_CATALOG=${ICEBERG_CATALOG:-local}. I'd be consistent.

--conf spark.sql.catalog.local.warehouse=$ICEBERG_WAREHOUSE \
create-iceberg-tpch.py \
--parquet-path $TPCH_DATA \
--catalog local \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same hardcoded catalog.

@codecov-commenter
Copy link

codecov-commenter commented Jan 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 59.95%. Comparing base (f09f8af) to head (07e1fa3).
⚠️ Report is 904 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3294      +/-   ##
============================================
+ Coverage     56.12%   59.95%   +3.82%     
- Complexity      976     1473     +497     
============================================
  Files           119      175      +56     
  Lines         11743    16167    +4424     
  Branches       2251     2682     +431     
============================================
+ Hits           6591     9693    +3102     
- Misses         4012     5126    +1114     
- Partials       1140     1348     +208     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

- Use --packages instead of --jars for table creation to match
  create-iceberg-tpch.py usage
- Use $ICEBERG_CATALOG variable instead of hardcoding 'local' in
  spark.sql.catalog config to be consistent with comet-tpch-iceberg.sh
- Clarify that JAR download is only needed for benchmark execution

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove
Copy link
Member Author

Thanks @mbutrovich! I pushed a commit to address the feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants