feat(filesystem): stream Iceberg merge/upsert with batched reads and GC interval by ajinzrathod-tmdc · Pull Request #1 · tmdc-io/dlt

ajinzrathod-tmdc · 2026-04-08T13:43:08Z

Summary
Streams Iceberg loads on the filesystem destination so large Parquet inputs are not fully materialized in memory. Merge/upsert and append/replace paths consume Arrow data as RecordBatchReader / batched scans, with optional periodic gc.collect() to cap memory during long runs.

Tests
tests/load/pipeline/test_open_table_pipeline.py: expanded coverage for the streaming Iceberg load / open-table pipeline behavior (per commit).

This is the snapshot after following operations:

Replace 500k records [1]
Append 500k records [2]
Replace 500k records [3, 4]

Each append/replace operation took ~7.5 mins (55GB of data). Peak RSS recorded: ~2800 MB

Upsert numbers (against current snapshot pointing to 55GB of data)

1 file (256 MB): ~17 minutes, ~2.2 GB RSS

4 files (256 MB each): ~1.5 hours, ~4 GB RSS

FWIW, this can be made significantly faster with a SQL-driven approach: I tested a three-step pipeline, (1) scan the table’s primary keys into SQLite, (2) identify only the Iceberg data files impacted by the incoming update Parquet files, and (3) apply updates via PyIceberg on just that planned subset, and the full run completed in ~10 minutes for all touched files with ~900 MB of update Parquet, delivering much lower runtime and memory usage compared to the streamed upsert approach for comparable work.

…GC interval

ajinzrathod-tmdc · 2026-04-08T13:46:20Z

P.S: Every operation was performed on memory limit of 4GB with batch size of 10000.

…llback

…ceberg merge engine production-ready

…t_forward and added support for multiple cloud providers

…tproof classpath In Spark local[*] mode, spark.executor.extraClassPath set programmatically via SparkConf is silently ignored — the local executor reuses the driver JVM but takes its classpath from the launcher, not from the conf. This caused HadoopFileIO to fail with ClassNotFoundException for SecureAzureBlobFileSystem on the executor side even when driver-side class loading worked. Fix: symlink (or copy) cached jars into \$SPARK_HOME/jars/ before SparkSession creation. That directory is loaded by Spark's launcher onto the JVM system classloader BEFORE any user code, so every subsequent classloader (driver, executor, Hadoop's Configuration.classLoader) sees the classes unconditionally. Falls back to spark.jars + extraClassPath + --jars when \$SPARK_HOME/jars is not writable (e.g. read-only filesystem). Co-authored-by: Cursor <cursoragent@cursor.com>

… Azure auth

…arquet batch via env

…onal logging

feat(filesystem): stream Iceberg merge/upsert with batched reads and …

7bc4f62

…GC interval

ajinzrathod-tmdc requested review from darshan-tmdc and rakesh-tmdc April 8, 2026 13:43

ajinzrathod-tmdc marked this pull request as draft April 9, 2026 06:24

ajinzrathod-tmdc and others added 12 commits April 11, 2026 01:45

feat(filesystem): add Spark-based Iceberg merge engine with atomic ro…

5d3ce99

…llback

[feat/iceberg-streaming-atomic-commit] feat(filesystem): make Spark I…

d4be6c3

…ceberg merge engine production-ready

feat(filesystem): true atomicity for Spark Iceberg writes via WAP fas…

a66d82a

…t_forward and added support for multiple cloud providers

add _DEFAULT_HADOOP_AZURE_VERSION

118a7d0

fix(spark): register Hadoop FS impl classes for abfss:// scheme

1999735

fix(spark): pass cached jars via --jars in PYSPARK_SUBMIT_ARGS

902ecfe

fix(spark-iceberg): add driver/executor extraClassPath for cached jars

424e3f3

fix(spark-iceberg): use correct Iceberg ADLSFileIO property names for…

1b61433

… Azure auth

spark-iceberg: add azure cred diagnostic prints

8459176

feat(filesystem): configurable Iceberg upload chunk, file size, and p…

32a1dce

…arquet batch via env

fix(spark-iceberg): remove diagnostic print/_diag noise, keep operati…

2cc038d

…onal logging

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(filesystem): stream Iceberg merge/upsert with batched reads and GC interval#1

feat(filesystem): stream Iceberg merge/upsert with batched reads and GC interval#1
ajinzrathod-tmdc wants to merge 13 commits into
1.21.0-mergedfrom
feat/iceberg-streaming-atomic-commit

ajinzrathod-tmdc commented Apr 8, 2026 •

edited

Loading

Uh oh!

ajinzrathod-tmdc commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ajinzrathod-tmdc commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ajinzrathod-tmdc commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ajinzrathod-tmdc commented Apr 8, 2026 •

edited

Loading