Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure #1995

koenvo · 2025-05-13T10:19:44Z

Summary

This PR updates the upsert logic to use batch processing. The main goal is to prevent out-of-memory (OOM) issues when updating large tables by avoiding loading all data at once.

Note: This has only been tested against the unit tests—no real-world datasets have been evaluated yet.

This PR partially depends on functionality introduced in #1817.

Notes

Duplicate detection across multiple batches is not possible with this approach.
~~All data is read sequentially, which may be slower than the parallel read used by to_arrow.~~ fixed using concurrent_tasks parameter

Performance Comparison

In setups with many small files, network and metadata overhead become the dominant factor. This impacts batch reading performance, as each file contributes relatively more overhead than payload. In the test setup used here, metadata access was the largest cost.

Using `to_arrow_batch_reader` (sequential):

Scan: 9993.50 ms
To list: 19811.09 ms

Using `to_arrow` (parallel):

Scan: 10607.88 ms

…M when updating large tables

…all filter expressions. Prevents memory pressure due to large filters

jayceslesar · 2025-06-02T15:25:37Z

fwiw I think we should try to get this merged in at some point. Some ideas:

Make it a flag to use the batchreader or not, some users might have basically infinite memory
Is there a more optimal way to batch data? Thinking along the lines of using partitions although that may already happen under the hood

koenvo · 2025-06-02T20:15:12Z

fwiw I think we should try to get this merged in at some point. Some ideas:

Make it a flag to use the batchreader or not, some users might have basically infinite memory

Is there a more optimal way to batch data? Thinking along the lines of using partitions although that may already happen under the hood

I've been thinking about what I (as a developer) want. The answer is: set max memory usage.

Some ideas:

Determine which partitions can fit together in memory and batch load those together
Fetching of parquet files can happen parallel and only do loading sequential
Combine 1 and 2

koenvo · 2025-06-02T21:32:24Z

Did an update and ran a quick benchmark with different concurrent_tasks settings on to_arrow_batch_reader():

table = catalog.get_table("some_table")

# Benchmark loop
p = table.scan().to_arrow_batch_reader(concurrent_tasks=100)
for batch in tqdm.tqdm(p):
    print(pool.max_memory())

Results (including `pool.max_memory()`):

concurrent_tasks=1 → 52it [00:06, 7.73it/s] | Max memory: 7.4 MB
concurrent_tasks=10 → 391it [00:06, 61.98it/s] | Max memory: 36.3 MB
concurrent_tasks=20 → 1412it [00:15, 83.54it/s] | Max memory: 147 MB
concurrent_tasks=100 → 1030it [00:09, 106.84it/s] | Max memory: 1.76 GB

Some more testing (on 100mbit connection):

scan.to_arrow_batch_reader(concurrent_tasks=10)
2025-06-03 11:02:48.986 INFO Starting
2025-06-03 11:05:10.927 INFO Rows: 13584102
2025-06-03 11:05:10.927 INFO Memory usage: 78.4MB

scan.to_arrow()
2025-06-03 11:05:47.211 INFO Starting
2025-06-03 11:08:09.907 INFO Rows: 13584102
2025-06-03 11:08:09.907 INFO Memory usage: 11GB

Note: Performance also depends on the network connection.

corleyma · 2025-06-03T03:53:19Z

pyiceberg/io/pyarrow.py

-        return self._record_batches_from_scan_tasks_and_deletes(tasks, deletes_per_file)
+
+        if concurrent_tasks is not None:
+            with ThreadPoolExecutor(max_workers=concurrent_tasks) as pool:


Rather than create your own threadpool executor here, I think you should use the ExecutorFactory defined elsewhere in the repo. It has a get_or_create method that prevents creating a new threadpool on every call, among other things.

Ah thanks! Had this changed but forgot to push. Only need to make sure I get a pool with the correct max_workers set. Can't just use the regular get_or_create as that might have an incorrect number of workers.

jayceslesar · 2025-06-03T14:01:35Z

pyiceberg/io/pyarrow.py

+        for batches in executor.map(
+            lambda task: list(self._record_batches_from_scan_tasks_and_deletes([task], deletes_per_file)), tasks
+        ):
+            for batch in batches:
+                current_batch_size = len(batch)
+                if self._limit is not None and total_row_count + current_batch_size >= self._limit:
+                    yield batch.slice(0, self._limit - total_row_count)
+
+                    # This break will also cancel all running tasks
+                    limit_reached = True
+                    break
+
+                yield batch
+                total_row_count += current_batch_size
+
+            if limit_reached:
+                break


Does this preserve the ordering still? It looks like it did in to_table

Yes. Executor.map maintains ordering by default. It first submits all jobs, and then waits for the result in original order.

koenvo · 2025-06-03T14:39:52Z

Did another update to get rid of the concurrent_tasks argument. It now defaults to the max-workers Config.

I also refactored to_arrow to use to_arrow_batch_reader under the hood to prevent duplicate implementations of the same functionality.

koenvo added 11 commits May 13, 2025 09:36

Move actual implementation of upsert from Table to Transaction

7abfee9

Fix some incorrect usage of schema

db334ae

Write a test for upsert transaction

cebfda3

Add failing test for multiple upserts in same transaction

52fd35e

Fix test

f336c0b

Add failing test

07890ac

Use Transaction.table_metadata when doing the data scan in upsert

ae0e60f

Remove as it's resolved

ce8d9ef

Use to_arrow_batch_reader instead of to_arrow in upsert to prevent OO…

5bdb0b8

…M when updating large tables

Filter rows to insert on each iteration instead of keeping a list of …

65fe36d

…all filter expressions. Prevents memory pressure due to large filters

Merge branch 'main' into feat/use-batchreader-in-upsert

4614543

jayceslesar mentioned this pull request Jun 2, 2025

table.upsert works only with batching #2058

Closed

Accept concurrent_tasks when fetching record_batches

88a4ad2

corleyma reviewed Jun 3, 2025

View reviewed changes

koenvo added 3 commits June 3, 2025 07:11

Use ExecutorFactory

f8acdb0

Simplify to_arrow to use the optimized to_record_batches

119d92f

Fix for shutdown pool after doing a map

5d3a6aa

koenvo marked this pull request as ready for review June 3, 2025 09:59

minor

445845d

koenvo changed the title ~~Use batchreader in upsert~~ Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure Jun 3, 2025

jayceslesar reviewed Jun 3, 2025

View reviewed changes

Improve comments in to_record_batches

6b8dace

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure #1995

Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure #1995

Uh oh!

koenvo commented May 13, 2025 •

edited

Loading

Uh oh!

jayceslesar commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025 •

edited

Loading

Uh oh!

corleyma Jun 3, 2025

Uh oh!

koenvo Jun 3, 2025 •

edited

Loading

Uh oh!

jayceslesar Jun 3, 2025

Uh oh!

koenvo Jun 3, 2025

Uh oh!

jayceslesar Jun 3, 2025

Uh oh!

koenvo commented Jun 3, 2025

Uh oh!

Uh oh!

Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure #1995

Are you sure you want to change the base?

Improve to_arrow_batch_reader performance + use to_arrow_batch_reader in upsert to lower memory pressure #1995

Uh oh!

Conversation

koenvo commented May 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Notes

Performance Comparison

Using to_arrow_batch_reader (sequential):

Using to_arrow (parallel):

Uh oh!

jayceslesar commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025

Uh oh!

koenvo commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Results (including pool.max_memory()):

Some more testing (on 100mbit connection):

Uh oh!

corleyma Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

koenvo Jun 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jayceslesar Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

koenvo Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

jayceslesar Jun 3, 2025

Choose a reason for hiding this comment

Uh oh!

koenvo commented Jun 3, 2025

Uh oh!

Uh oh!

koenvo commented May 13, 2025 •

edited

Loading

Using `to_arrow_batch_reader` (sequential):

Using `to_arrow` (parallel):

koenvo commented Jun 2, 2025 •

edited

Loading

Results (including `pool.max_memory()`):

koenvo Jun 3, 2025 •

edited

Loading