Skip to content

Data corruption when inserting OracleDataFrames that are views #574

@vladidobro

Description

@vladidobro

Hi,

  1. What versions are you using?
con.version: 23.26.1.1.0
platform.platform: macOS-26.2-arm64-arm-64bit
sys.maxsize > 2**32: True
platform.python_version: 3.12.12
oracledb.__version__: 3.4.2
  1. Is it an error or a hang or a crash?

Error

  1. What error(s) or behavior you are seeing?
import pandas as pd
import oracledb

with oracledb.connect(...) as con:
    cur = con.cursor()
    cur.execute('create private temporary table ora$ptt_mytest (a int) on commit drop definition')
    df = pd.DataFrame({'A': [0, 1]}).convert_dtypes(dtype_backend='pyarrow')
    df = df.iloc[1:2]
    print('Inserted:\n', df)
    cur.executemany('insert into ora$ptt_mytest (a) values (:a)', df)
    print('Selected:\n', pd.DataFrame.from_arrow(con.fetch_df_all('select * from ora$ptt_mytest')))

Both "inserted" and "selected" dataframes should be the same, but they are different

Inserted:
    A
1  1
Selected:
      A
0  0.0

We have isolated the problem to be only when the dataframe is some kind of view of another one.
Here we are creating a dataframe with 2 rows, and then calling df.iloc[1:2], which selects only the second row.
But instead, oracledb inserted the first row.

This does not happen if we (i believe)

  • don't call .convert_dtypes(), because then the Arrow PyCapsule interface creates new buffers
  • round trip the dataframe e.g. through a file, so that we are sure it owns its data

This leads me to believe that when I call df.iloc[1:2], the result will point to the same Arrow buffer with an offset, and this offset is ignored by oracledb (but length is not ignored, because only 1 row from the buffer is inserted).

We ran into this problem with manual batching, i.e.

big_df: pd.DataFrame
batches = [big_df.iloc[i:i+1000] for i in range(0, len(big_df), 1000)]
for b in batches:
    cur.executemany('INSERT ...', b)

which kept inserting the first batch again and again.

This may be a problem with Pandas not correctly exporting Arrow PyCapsule, but probably no, because polars.from_arrow(df) gives the correct result.

  1. Does your application call init_oracle_client()?

No, using thin mode.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions