Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Cannot convert existing column to categorical #52593

Open
3 tasks done
Zahlii opened this issue Apr 11, 2023 · 19 comments
Open
3 tasks done

BUG: Cannot convert existing column to categorical #52593

Zahlii opened this issue Apr 11, 2023 · 19 comments
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@Zahlii
Copy link

Zahlii commented Apr 11, 2023

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

x = pd.DataFrame({
    "A": pd.Categorical(["A", "B"], categories=["A", "B"]),
    "B": [1,2],
    "C": ["D", "E"]
})
print(x.dtypes)
x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x.dtypes)

Issue Description

When setting an existing column to its categorical equivalent, the underlying dtypes stay the same.

Expected Behavior

Output now:

A category
B int64
C object
dtype: object
A category
B int64
C object
dtype: object

Expected output:
A category
B int64
C object
dtype: object
A category
B int64
C category <----
dtype: object

Installed Versions

INSTALLED VERSIONS

commit : 478d340
python : 3.9.14.final.0
python-bits : 64
OS : Darwin
OS-release : 22.4.0
Version : Darwin Kernel Version 22.4.0: Mon Mar 6 20:59:58 PST 2023; root:xnu-8796.101.5~3/RELEASE_ARM64_T6020
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : None
LOCALE : None.UTF-8

pandas : 2.0.0
numpy : 1.23.2
pytz : 2022.7.1
dateutil : 2.8.2
setuptools : 67.4.0
pip : 23.0.1
Cython : 0.29.33
pytest : 7.2.2
hypothesis : None
sphinx : 6.1.3
blosc : None
feather : None
xlsxwriter : 3.0.8
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : 2.9.5
jinja2 : 3.0.3
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2023.1.0
gcsfs : None
matplotlib : 3.7.0
numba : 0.56.4
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 11.0.0
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.8.1
snappy : None
sqlalchemy : 1.4.46
tables : None
tabulate : 0.9.0
xarray : None
xlrd : None
zstandard : None
tzdata : 2022.7
qtpy : None
pyqt5 : None

@Zahlii Zahlii added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023
@phofl
Copy link
Member

phofl commented Apr 11, 2023

commit 4e4be0bfa8f74b9d453aa4163d95660c04ffea0c
Author: jbrockmendel <jbrockmendel@gmail.com>
Date:   Wed Dec 21 11:57:24 2022 -0800

    DEPR: enforce inplaceness for df.loc[:, foo]=bar (#49775)
    
    * DEPR: enforce inplaceness for df.loc[:, foo]=bar

#49775

cc @jbrockmendel

Our checks aren't strict enough. This is also a problem when trying to set string columns, probably everything when setting into object.

@phofl phofl added Indexing Related to indexing on series/frames, not to indexes themselves and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 11, 2023
@phofl phofl added this to the 2.0.1 milestone Apr 11, 2023
@phofl phofl added the Regression Functionality that used to work in a prior pandas version label Apr 11, 2023
@jbrockmendel
Copy link
Member

isnt this a case where the user should use df["C"] = ... instead of df.loc[:, "C"] = ...?

@Zahlii
Copy link
Author

Zahlii commented Apr 12, 2023

isnt this a case where the user should use df["C"] = ... instead of df.loc[:, "C"] = ...?

Well, in our current codebase we have established that we want to be absolutely clear to which axis we are referring to when setting items, i.e. we explicitly use df.loc[X, :] = ... to set rows, and df.loc[:, X] to set columns. Otherwise, we had the issue that df["C"] in some cases unexplicably set the row instead of the column when we were dealing with pandas dataframes where index = columns (i.e. for transition matrices). In any case, it has worked this way since pandas 0.x, so I would assume that several other people may face the same issue; what's worse it that it just silently ignores the users intention. The only reason why we noticed this was due to automated nightly tests.

@jbrockmendel
Copy link
Member

Our checks aren't strict enough. This is also a problem when trying to set string columns, probably everything when setting into object.

@phofl can you expand on this? im not sure which checks you're referring to

@glemaitre
Copy link
Contributor

I stumbled into this issue as well with other types:

# %%
import pandas as pd

df = pd.DataFrame({
    'x': ["2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04", "2012-01-05"],
    'y': [2, 4, 6, 8, 10],
})
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       5 non-null      object
 1   y       5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes

# %%
df.loc[:, "x"] = pd.to_datetime(df.loc[:, "x"])
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5 entries, 0 to 4
Data columns (total 2 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   x       5 non-null      object
 1   y       5 non-null      int64 
dtypes: int64(1), object(1)
memory usage: 208.0+ bytes

where the columns "x" as not been casted to datetime64[ns] as in pandas 1.5.

I would personally argue that using df["x"] = pd.to_datetime(df["x"]) is counter-intuitive since it is close to the issue of the chained indexing raising the "SettingWithCopyWarning":

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df["x"][:] = 10

I, therefore, agree with the argument of @Zahlii regarding .loc being explicit (cf. #52593 (comment)).

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented May 31, 2023

Essentially whenever your original column is object dtype, doing a df.loc[:, "col"] = ... will now never preserve the dtype of the values you are setting, because you can set anything in an object dtype array ..

That seems a quite serious usability regression, that we might not have fully thought through when making this change.

I also think a lot of people (and teachers, tutorials, ec) are recommending to use loc to be more explicit, so the df.loc[: "col"] = .. is very widespread.
If we decide to keep this change as is, we should at least give this a lot more visibility in our docs and release notes.

@jorisvandenbossche
Copy link
Member

And we should maybe consider to revert the behaviour (partly? just for object dtype?) so we can do another attempt to add a more focused deprecation warning? (the cases that have been reported in the several issues should be relatively easy to detect)

@jbrockmendel
Copy link
Member

And we should maybe consider to revert the behaviour (partly? just for object dtype?) so we can do another attempt to add a more focused deprecation warning?

In 1.5.3

df = pd.DataFrame({
    'x': ["2012-01-01", "2012-01-02", "2012-01-03", "2012-01-04", "2012-01-05"],
    'y': [2, 4, 6, 8, 10],
})

>>> df.loc[:, "x"] = pd.to_datetime(df.loc[:, "x"])
<stdin>:1: DeprecationWarning: In a future version, `df.iloc[:, i] = newvals` will attempt to set the values inplace instead of always setting a new array. To retain the old behavior, use either `df[df.columns[i]] = newvals` or, if columns are non-unique, `df.isetitem(i, newvals)`

What warning would you issue instead?

I'd be against reverting as the old behavior involved a ton of inconsistencies. I could be OK with adding a warning specific to object-dtype-and-full-slice cases.

@jorisvandenbossche
Copy link
Member

Sorry, I was confused with another deprecation we reverted in 1.5.x (slicing with ints), for the deprecation here we in the end only changed it from FutureWarning to DeprecationWarning to make it less visible: #48673.
So yes, we certainly still did raise a warning in 1.5.x (although it seems we never fixed the message, as that is still giving the alternative with positional indices).

I could be OK with adding a warning specific to object-dtype-and-full-slice cases.

You mean adding a general UserWarning when you do df.loc[:, "col"] for an object dtype column (and setting with an object that has a specific (and non-object) dtype), in the assumption you basically never want to do this (because now you always loose the dtype of the values you are setting, which is probably never going to be the intent of the user?)

@jbrockmendel
Copy link
Member

You mean adding a general UserWarning when you do df.loc[:, "col"] for an object dtype column (and setting with an object that has a specific (and non-object) dtype), in the assumption you basically never want to do this (because now you always loose the dtype of the values you are setting, which is probably never going to be the intent of the user?)

Yes.

@Zahlii
Copy link
Author

Zahlii commented Jun 23, 2023

I would like to expand on this, I find it very confusing even with the above notice.
I did some further testing, and for me it is VERY confusing that df[ABC] operates on columns as long as dtype(ABC) == dtype(columns), but on rows if otherwise, i.e. with boolean indexing.

Consider the following extended scenario. Tasks that are common in preprocessing, such as converting batches of columns based on e.g. a boolean mask, is no longer possible (easily), as using boolean masks to operate on columns is no longer possible, instead we have to use df[df.columns[mask]].

import pandas as pd
from pandas.core.dtypes.common import is_categorical_dtype

x = pd.DataFrame({
    "A": pd.Categorical(["A", "B", "B"], categories=["A", "B"]),
    "B": [1,2, 3],
    "C": ["D", "E", "E"]
}, index=["A", "B", "C"])
print(">Original")
print(x, "\n", x.dtypes)
print(">Set Categorical")
# doesn't work
x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x, "\n", x.dtypes)

print(">Set Categorical Direct")
# works
x["C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
print(x, "\n", x.dtypes)

print(">Convert all Categories back to Str")
mask_cat = x.dtypes.map(is_categorical_dtype).values
# this won't work
x.loc[:, mask_cat] = x.loc[:, mask_cat].astype(str)
print(x, "\n", x.dtypes)

print(">Convert all Categories back to Str Direct")
# this won't work, either, because masks operate on ROWS
x[mask_cat] = x[mask_cat].astype(str)
print(x, "\n", x.dtypes)


print(">Convert all Categories back to Str via Columns")
# this works, but it is a pain in the ass
mask_cat_cols = x.columns[mask_cat]
x[mask_cat_cols] = x[mask_cat_cols].astype(str)
print(x, "\n", x.dtypes)

@lithomas1 lithomas1 modified the milestones: 2.0.3, 2.0.4 Jun 27, 2023
@aalyousfi
Copy link

I'm facing the same issue. A simplified example:

df["month"] = pd.to_datetime(df["month"])
some_date = datetime.utcnow().date() - relativedelta(months=25)
df = df.loc[df.month.dt.date >= some_date]

I get an error:

Can only use .dt accessor with datetimelike values.

It's confusing. Is this behavior documented somewhere? It's a breaking change. Or is it a bug?

@lithomas1 lithomas1 modified the milestones: 2.0.4, 2.1.1 Aug 30, 2023
@jbrockmendel
Copy link
Member

The OP example looks like it works on main. Can anyone else confirm?

@lithomas1 lithomas1 modified the milestones: 2.1.1, 2.1.2 Sep 21, 2023
@lithomas1
Copy link
Member

Still seems to fail for me

>>> x = pd.DataFrame({
...     "A": pd.Categorical(["A", "B"], categories=["A", "B"]),
...     "B": [1,2],
...     "C": ["D", "E"]
... })
>>> print(x.dtypes)
A    category
B       int64
C      object
dtype: object
>>> x.loc[:, "C"] = pd.Categorical(x.loc[:, "C"], categories=["D", "E"])
>>> print(x.dtypes)
A    category
B       int64
C      object
dtype: object

C still seems to be object.

@choucavalier
Copy link
Contributor

I have the same issue

import pandas as pd

d = pd.DataFrame(dict(wanna_be_cat=['A', 'B', 'B']))

d.loc[:, 'wanna_be_cat'] = d['wanna_be_cat'].astype('category')

# 'wanna_be_cat' column not modified to become categorical
print(d['wanna_be_cat'].dtype)

d['wanna_be_cat'] = d['wanna_be_cat'].astype('category')

# 'wanna_be_cat' column modified to become categorical
print(d['wanna_be_cat'].dtype)

the fact that it works without .loc but doesn't with .loc is very confusing.

@ManuelNavarroGarcia
Copy link

ManuelNavarroGarcia commented Nov 22, 2023

I have a similar problem in this example with pandas == 2.1.2:

>>> import pandas as pd
>>> df = pd.DataFrame({"a": [1, 2, 3], "b": [[1, 2, 3], [4, 5], [6]]})
>>> cols = df.select_dtypes("integer").columns
>>> df.loc[:, cols].dtypes
a    int64
dtype: object
>>> df.loc[:, cols] = df.loc[:, cols].apply(pd.to_numeric, downcast="integer")
>>> df.loc[:, cols].dtypes
a    int64
dtype: object
>>> df[cols] = df[cols].apply(pd.to_numeric, downcast="integer")
>>> df.loc[:, cols].dtypes
a    int8
dtype: object

I would assume that both ways to insert cols to the original DataFrame should be equivalent, but they are not.

@lithomas1 lithomas1 modified the milestones: 2.1.4, 2.2 Dec 8, 2023
@lithomas1 lithomas1 modified the milestones: 2.2, 2.2.1 Jan 20, 2024
@n-splv
Copy link

n-splv commented Feb 12, 2024

Just wanted to convert dtype (float -> int) of multiple columns in my df and encountered this issue:

# This doesn't work
df.iloc[:, 2:] = df.iloc[:, 2:].astype(int)

# Neither does this
df.loc[:, df.columns[2:]] = df.loc[:, df.columns[2:]].astype(int)

# This works
for col in df.columns[2:]:
    df[col] = df[col].astype(int)

pandas 2.1.3

@lithomas1 lithomas1 modified the milestones: 2.2.1, 2.2.2 Feb 23, 2024
@bcrotty
Copy link

bcrotty commented Mar 14, 2024

I have the same behavior as @n-splv on pandas 2.2.1, but if the correct solution is to not use .loc, then I don't think there should be a SettingWithCopyWarning.

df = pd.DataFrame({'number': [1.0, 2.0, 3.0]})
cp = df.loc[df["number"] % 2 == 1]

# This does not change the dtype
cp.loc[:, "number"] = cp["number"].astype(int)

# This changes the dtype but produces a SettingWithCopyWarning
cp["number"] = cp["number"].astype(int)

Or is there a better way to change the dtype?

@yuanx749
Copy link
Contributor

Facing the same issue. For the time being, I used:

df = df.astype({col: "category" for col in categorical_cols})

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

No branches or pull requests