-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Cannot convert existing column to categorical #52593
Comments
Our checks aren't strict enough. This is also a problem when trying to set string columns, probably everything when setting into object. |
isnt this a case where the user should use |
Well, in our current codebase we have established that we want to be absolutely clear to which axis we are referring to when setting items, i.e. we explicitly use df.loc[X, :] = ... to set rows, and df.loc[:, X] to set columns. Otherwise, we had the issue that df["C"] in some cases unexplicably set the row instead of the column when we were dealing with pandas dataframes where index = columns (i.e. for transition matrices). In any case, it has worked this way since pandas 0.x, so I would assume that several other people may face the same issue; what's worse it that it just silently ignores the users intention. The only reason why we noticed this was due to automated nightly tests. |
@phofl can you expand on this? im not sure which checks you're referring to |
I stumbled into this issue as well with other types: # %%
import pandas as pd
df = pd.DataFrame({
'x': ["2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04", "2012-01-05"],
'y': [2, 4, 6, 8, 10],
})
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 5 non-null object
1 y 5 non-null int64
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes
# %%
df.loc[:, "x"] = pd.to_datetime(df.loc[:, "x"])
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 x 5 non-null object
1 y 5 non-null int64
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes where the columns I would personally argue that using A value is trying to be set on a copy of a slice from a DataFrame
See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
df["x"][:] = 10 I, therefore, agree with the argument of @Zahlii regarding |
Essentially whenever your original column is object dtype, doing a That seems a quite serious usability regression, that we might not have fully thought through when making this change. I also think a lot of people (and teachers, tutorials, ec) are recommending to use |
And we should maybe consider to revert the behaviour (partly? just for object dtype?) so we can do another attempt to add a more focused deprecation warning? (the cases that have been reported in the several issues should be relatively easy to detect) |
In 1.5.3
What warning would you issue instead? I'd be against reverting as the old behavior involved a ton of inconsistencies. I could be OK with adding a warning specific to object-dtype-and-full-slice cases. |
Sorry, I was confused with another deprecation we reverted in 1.5.x (slicing with ints), for the deprecation here we in the end only changed it from FutureWarning to DeprecationWarning to make it less visible: #48673.
You mean adding a general UserWarning when you do |
Yes. |
I would like to expand on this, I find it very confusing even with the above notice. Consider the following extended scenario. Tasks that are common in preprocessing, such as converting batches of columns based on e.g. a boolean mask, is no longer possible (easily), as using boolean masks to operate on columns is no longer possible, instead we have to use df[df.columns[mask]]. import pandas as pd
from pandas.core.dtypes.common import is_categorical_dtype
x = pd.DataFrame({
"A": pd.Categorical(["A", "B", "B"], categories=["A", "B"]),
"B": [1,2, 3],
"C": ["D", "E", "E"]
}, index=["A", "B", "C"])
print(">Original")
print(x, "\n", x.dtypes)
print(">Set Categorical")
# doesn't work
x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x, "\n", x.dtypes)
print(">Set Categorical Direct")
# works
x["C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x, "\n", x.dtypes)
print(">Convert all Categories back to Str")
mask_cat = x.dtypes.map(is_categorical_dtype).values
# this won't work
x.loc[:, mask_cat] = x.loc[:, mask_cat].astype(str)
print(x, "\n", x.dtypes)
print(">Convert all Categories back to Str Direct")
# this won't work, either, because masks operate on ROWS
x[mask_cat] = x[mask_cat].astype(str)
print(x, "\n", x.dtypes)
print(">Convert all Categories back to Str via Columns")
# this works, but it is a pain in the ass
mask_cat_cols = x.columns[mask_cat]
x[mask_cat_cols] = x[mask_cat_cols].astype(str)
print(x, "\n", x.dtypes) |
I'm facing the same issue. A simplified example:
I get an error: Can only use .dt accessor with datetimelike values. It's confusing. Is this behavior documented somewhere? It's a breaking change. Or is it a bug? |
The OP example looks like it works on main. Can anyone else confirm? |
Still seems to fail for me
C still seems to be object. |
I have the same issue import pandas as pd
d = pd.DataFrame(dict(wanna_be_cat=['A', 'B', 'B']))
d.loc[:, 'wanna_be_cat'] = d['wanna_be_cat'].astype('category')
# 'wanna_be_cat' column not modified to become categorical
print(d['wanna_be_cat'].dtype)
d['wanna_be_cat'] = d['wanna_be_cat'].astype('category')
# 'wanna_be_cat' column modified to become categorical
print(d['wanna_be_cat'].dtype) the fact that it works without |
I have a similar problem in this example with
I would assume that both ways to insert |
Just wanted to convert dtype (float -> int) of multiple columns in my df and encountered this issue:
pandas 2.1.3 |
I have the same behavior as @n-splv on pandas 2.2.1, but if the correct solution is to not use
Or is there a better way to change the dtype? |
Facing the same issue. For the time being, I used: df = df.astype({col: "category" for col in categorical_cols}) |
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
When setting an existing column to its categorical equivalent, the underlying dtypes stay the same.
Expected Behavior
Output now:
A category
B int64
C object
dtype: object
A category
B int64
C object
dtype: object
Expected output:
A category
B int64
C object
dtype: object
A category
B int64
C category <----
dtype: object
Installed Versions
INSTALLED VERSIONS
commit : 478d340
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8
pandas : 2.0.0
numpy : 1.23.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.2
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : 3.0.8
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.1.0
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : 1.4.46
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None
The text was updated successfully, but these errors were encountered: