Skip to content

Adapt to pandas 3.0#558

Open
tramora wants to merge 2 commits intomainfrom
536-adapt-to-pandas3
Open

Adapt to pandas 3.0#558
tramora wants to merge 2 commits intomainfrom
536-adapt-to-pandas3

Conversation

@tramora
Copy link
Collaborator

@tramora tramora commented Feb 20, 2026

  • For python 3.10, 2.3.3 is still used but with the new pandas StringDtype enabled
  • For python 3.11+, the later 3.0.0+ versions are used

Fixes #536


TODO Before Asking for a Review

  • Rebase your branch to the latest version of main (or main-v10)
  • Make sure all CI workflows are green
  • When adding a public feature/fix: Update the Unreleased section of CHANGELOG.md (no date)
  • Self-Review: Review "Files Changed" tab and fix any problems you find
  • API Docs (only if there are changes in docstrings, rst files or samples):
    • Check the docs build without warning: see the log of the API Docs workflow
    • Check that your changes render well in HTML: download the API Docs artifact and open index.html
    • If there are any problems it is faster to iterate by building locally the API Docs

@tramora tramora force-pushed the 536-adapt-to-pandas3 branch 7 times, most recently from 326a9e5 to 0162ef9 Compare February 23, 2026 11:07
@tramora tramora force-pushed the 536-adapt-to-pandas3 branch from 0162ef9 to 97de414 Compare February 25, 2026 10:56
Copy link
Collaborator

@popescu-v popescu-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the remaining comments.

@tramora tramora force-pushed the 536-adapt-to-pandas3 branch 3 times, most recently from fe55f7f to f94e426 Compare February 25, 2026 17:41
if isinstance(column_numpy_type, pd.StringDtype):
column_max_size = column.str.len().max()
# Warning pandas.Series.str.len() returns a float64
column_max_size = int(column.str.len().max())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '3.0.0'
In [3]: x = pd.DataFrame(["a", "b", "c"]).astype(pd.StringDtype())
In [4]: x[0].str.len().max()
Out[4]: np.int64(1)

Hence, it seems than (on Pandas 3.0.0), .str.len() returns a numpy.int64.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But unfortunately Series.str.len() returns a float64 and the previous code broke a specific test
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.len.html#pandas.Series.str.len

Copy link
Collaborator

@popescu-v popescu-v Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only the case if at least one element is not of string type, in which case .str.len() results in a NaN:

In [1]: y = pd.Series(["a", 5, "c"])

In [2]: y.str.len()
Out[2]: 
0    1.0
1    NaN
2    1.0
dtype: float64

But this is no longer the case if all elements are properly typed to string:

In [3]: y = pd.Series(["a", 5, "c"]).astype(pd.StringDtype())

In [4]: y.str.len()
Out[4]: 
0    1
1    1
2    1
dtype: Int64

However, the code executed here is in the scope of an if isinstance(column.dtype, pd.StringDtype); hence, the return type should be Int64 AFAIU.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reverted to the previous version of the code, if a test break I'll have a look at the exact condition (if a NaN remains in the series)

Copy link
Collaborator

@popescu-v popescu-v Feb 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It breaks, because there are missing (StringDtype) values in the input dataframe. See also pandas-dev/pandas#51948. Hence, I would:

  • add a comment before this line of code, stating the issue and referring to the aforementioned issue
  • cast to a Pandas integer data type: columns.str.len().astype(pd.Int64Dtype()).max().

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I commented and added the astype conversion call

Copy link
Member

@folmos-at-orange folmos-at-orange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I would change the README for the minimum versions though (or eliminate that section).

@tramora tramora force-pushed the 536-adapt-to-pandas3 branch from f94e426 to 6fc418d Compare February 26, 2026 11:38
@tramora tramora requested a review from popescu-v February 26, 2026 11:42
Copy link
Collaborator

@popescu-v popescu-v left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One single remaining pending change (see the comment).

Thierry RAMORASOAVINA added 2 commits February 26, 2026 20:23
- For python 3.10, 2.3.3 is still used but with the new pandas StringDtype enabled
- For python 3.11+, the later 3.0.0+ versions are used
@tramora tramora force-pushed the 536-adapt-to-pandas3 branch from 6fc418d to d044263 Compare February 26, 2026 19:25
@tramora tramora requested a review from popescu-v February 26, 2026 19:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Adapt to Pandas 3.0.0+

3 participants