Skip to content

table.upsert works only with batching #2058

Closed
@anuunchin

Description

@anuunchin

Feature Request / Improvement

Hi team,

I recently encountered that that the table.upsert results in some unexpected low level error(s), such as bus error, or illegal hardware instruction error. I tried to isolate what I have in the attached files.

How to recreate

  • Run first_run.py
  • Run second_run.py with the commented out upsert:
     #table.upsert(
     #    df=data,
     #    join_cols=['block_number', 'transaction_index', 'log_index'],
     #    when_matched_update_all=True,
     #    when_not_matched_insert_all=True,
     #    case_sensitive=True,
     #)

Note that the following works:

for rb in data.to_batches(max_chunksize=1_000):
 batch_tbl = pa.Table.from_batches([rb])

 table.upsert(
     df=batch_tbl,
     join_cols=['block_number', 'transaction_index', 'log_index'],
     when_matched_update_all=True,
     when_not_matched_insert_all=True,
     case_sensitive=True,
 )

Versions

Pyiceberg version: 0.9.1
Pyarrow: 20.0.0 (Also tried with 18.0.0, 17.0.0)
Hardware: Apple M2

Additional context

The same issue seems to have been mentioned here.

Thanks you in advance! 😊

first.zip
second.zip
scripts.zip

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions