Skip to content

Bug/AP-25573 Fix bugs with pandas >= 2.1.0#56

Open
HedgehogCode wants to merge 8 commits intomasterfrom
bug/AP-25573-pandas-2-3-support
Open

Bug/AP-25573 Fix bugs with pandas >= 2.1.0#56
HedgehogCode wants to merge 8 commits intomasterfrom
bug/AP-25573-pandas-2-3-support

Conversation

@HedgehogCode
Copy link
Contributor

No description provided.

@HedgehogCode HedgehogCode requested a review from a team as a code owner February 3, 2026 14:29
@HedgehogCode HedgehogCode requested review from Copilot and knime-ghub-bot and removed request for a team February 3, 2026 14:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses bugs that emerged with pandas version 2.1.0 and later, specifically related to how pandas handles extension arrays and struct dict encoded data. The changes fix data corruption issues during pandas-arrow round-trips and ensure compatibility with newer pandas versions.

Changes:

  • Fixed type inference issues in PyArrow array creation by explicitly specifying type=pa.bool_()
  • Modified KnimePandasExtensionArray.take() to prevent struct dict encoding corruption when pandas re-batches data
  • Added empty array handling in isna() method
  • Converted test suite from unittest to pytest with parameterized testing for both populated and empty tables

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
pytest.ini Re-enabled pytest class discovery to support class-based tests
org.knime.python3.arrow/src/main/python/knime/_arrow/_types.py Added explicit boolean type to mask array creation
org.knime.python3.arrow/src/main/python/knime/_arrow/_table.py Added TODO comment about struct-dict-encoded array re-batching issue
org.knime.python3.arrow/src/main/python/knime/_arrow/_pandas.py Fixed pandas 2.1.0+ compatibility by handling empty arrays and optimizing take() to prevent encoding corruption
org.knime.python3.arrow/src/main/python/knime/_arrow/_dictencoding.py Added explicit boolean type to mask array creation
org.knime.python3.arrow.tests/src/test/python/unittest/test_table.py Converted from unittest to pytest with parameterized testing for empty/non-empty tables
org.knime.python3.arrow.tests/src/test/python/unittest/test_pandas_extension_type.py Added regression test for struct dict encoding corruption bug
org.knime.python3.arrow.tests/src/test/python/unittest/structDictEncodedDataCellsWithBatches.zip Added test data file for struct dict encoding regression test
org.knime.python3.arrow.tests/src/test/python/unittest/emptyGeneratedTestData.zip Added test data file for empty table testing

Copy link
Contributor

Copilot AI commented Feb 5, 2026

@HedgehogCode I've opened a new pull request, #57, to work on those changes. Once the pull request is ready, I'll request review from you.

@HedgehogCode HedgehogCode force-pushed the bug/AP-25573-pandas-2-3-support branch from ce94bd8 to 239a578 Compare February 5, 2026 09:23
…mpty arrays

The usual case works for empty arrays and handles dtypes of extension arrays correctly.

AP-25573 (Workflow tests fail with pandas 2.3)
… of int/bool

AP-25573 (Workflow tests fail with pandas 2.3)
Otherwise, if the chunked array has no chunks and we attempt to concatenate
no arrays, which fails.

AP-25573 (Workflow tests fail with pandas 2.3)
@HedgehogCode HedgehogCode force-pushed the bug/AP-25573-pandas-2-3-support branch from 239a578 to 6f8f9dd Compare February 5, 2026 09:29
Copilot AI review requested due to automatic review settings February 5, 2026 09:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 3 comments.

…are taken

Bug: In pandas 2.1.0+, pd.concat changed its behavior and now calls
KnimePandasExtensionArray.take with indices [0,1,2,...,n] to effectively copy the
array. When take is called on struct dict encoded arrays, it delegates to
storage.take(), which merges chunks of the ChunkedArray without recalculating the
struct-dict encoding. This breaks the dictionary structure across chunk boundaries:
keys in chunk 2 reference indices that only exist in chunk 1's data array, causing
"Cannot read DataCell with empty type information" errors.

Fix: Shortcut the specific case where all indices are taken sequentially by
returning a copy instead of calling storage.take(). This fixes the pd.concat
code path that triggers during arrow → pandas → arrow round-trips.

Limitation: This is only a partial fix for the very specific case of taking all
indices. The general take() and re-batching behavior for struct dict encoded arrays
remains buggy when partial indices are selected or arbitrary re-batching occurs.

AP-25573 (Workflow tests fail with pandas 2.3)
…ionArray.take

AP-25573 (Workflow tests fail with pandas 2.3)
@HedgehogCode HedgehogCode force-pushed the bug/AP-25573-pandas-2-3-support branch from 6f8f9dd to e7ba282 Compare February 5, 2026 11:58
@sonarqubecloud
Copy link

sonarqubecloud bot commented Feb 5, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants