Conversation
326a9e5 to
0162ef9
Compare
0162ef9 to
97de414
Compare
popescu-v
left a comment
There was a problem hiding this comment.
See the remaining comments.
fe55f7f to
f94e426
Compare
khiops/sklearn/dataset.py
Outdated
| if isinstance(column_numpy_type, pd.StringDtype): | ||
| column_max_size = column.str.len().max() | ||
| # Warning pandas.Series.str.len() returns a float64 | ||
| column_max_size = int(column.str.len().max()) |
There was a problem hiding this comment.
In [1]: import pandas as pd
In [2]: pd.__version__
Out[2]: '3.0.0'
In [3]: x = pd.DataFrame(["a", "b", "c"]).astype(pd.StringDtype())
In [4]: x[0].str.len().max()
Out[4]: np.int64(1)Hence, it seems than (on Pandas 3.0.0), .str.len() returns a numpy.int64.
There was a problem hiding this comment.
But unfortunately Series.str.len() returns a float64 and the previous code broke a specific test
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.len.html#pandas.Series.str.len
There was a problem hiding this comment.
This is only the case if at least one element is not of string type, in which case .str.len() results in a NaN:
In [1]: y = pd.Series(["a", 5, "c"])
In [2]: y.str.len()
Out[2]:
0 1.0
1 NaN
2 1.0
dtype: float64But this is no longer the case if all elements are properly typed to string:
In [3]: y = pd.Series(["a", 5, "c"]).astype(pd.StringDtype())
In [4]: y.str.len()
Out[4]:
0 1
1 1
2 1
dtype: Int64However, the code executed here is in the scope of an if isinstance(column.dtype, pd.StringDtype); hence, the return type should be Int64 AFAIU.
There was a problem hiding this comment.
I reverted to the previous version of the code, if a test break I'll have a look at the exact condition (if a NaN remains in the series)
There was a problem hiding this comment.
It breaks, because there are missing (StringDtype) values in the input dataframe. See also pandas-dev/pandas#51948. Hence, I would:
- add a comment before this line of code, stating the issue and referring to the aforementioned issue
- cast to a Pandas integer data type:
columns.str.len().astype(pd.Int64Dtype()).max().
There was a problem hiding this comment.
I commented and added the astype conversion call
folmos-at-orange
left a comment
There was a problem hiding this comment.
LGTM.
I would change the README for the minimum versions though (or eliminate that section).
f94e426 to
6fc418d
Compare
popescu-v
left a comment
There was a problem hiding this comment.
One single remaining pending change (see the comment).
- For python 3.10, 2.3.3 is still used but with the new pandas StringDtype enabled - For python 3.11+, the later 3.0.0+ versions are used
6fc418d to
d044263
Compare
Fixes #536
TODO Before Asking for a Review
main(ormain-v10)Unreleasedsection ofCHANGELOG.md(no date)index.html