Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Assigning scalars to nullable dataframe columns reverts to default dtypes. #31763

Open
ghost opened this issue Feb 6, 2020 · 11 comments
Open
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data

Comments

@ghost
Copy link

ghost commented Feb 6, 2020

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, dtype='string')

In [3]: df.dtypes
Out[3]:
A    string
dtype: object

In [4]: df['A'] = 'test'

In [5]: df.dtypes
Out[5]:
A    object
dtype: object

In [6]: df['A'] = pd.Series(['test'] * len(df), dtype='string', index=df.index)

In [7]: df
Out[7]:
      A
0  test
1  test
2  test

In [8]: df.dtypes
Out[8]:
A    string
dtype: object

# Don't need to set the index with pd.array()
In [9]: df['A'] = pd.array(['test'] * len(df) , dtype='string')

In [40]: df.dtypes
Out[40]:
A    string
dtype: object

Problem description

Because the new nullable datatypes are not yet the default, it's very tricky to convince dtypes to stay as nullable dtypes. Helpful shortcuts for assigning repeating values to a dataframe column infer the default types, even if the column is currently a nullable type, and the data being added can be represented in the same nullable type. I'm not sure if it's best just to wait until the nullable dtypes become the default, and live with the workarounds shown above for now, or if perhaps another keyword could be added to df.assign() to specify the intended dtype of a column.

Alternatively, is there any scope for adding dtype information to scalars?
Something like:

df['A']  = pd.typed('test', dtype='string')
# OR
df['A'] = pd.StringDtype('test')
@TomAugspurger
Copy link
Contributor

assign can't take additional kwargs. All the keywords are new column names.

We should be able to fix DataFrame.__setitem__ rather than introduce workarounds.

@jreback
Copy link
Contributor

jreback commented Feb 6, 2020

this is as expected now; a single string is inferred to object type; so you have to be explicit about the dtypes

-1 on making this more complex with options

this should work (though i am sure if we actually broadcast this)

df[‘A’] = pd.array([‘test’, dtype=‘string’)

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 6, 2020

this is as expected now; a single string is inferred to object type; so you have to be explicit about the dtypes

Don't we have logic about whether an existing block can hold a new value? Or does that not work for scalars?

@jreback
Copy link
Contributor

jreback commented Feb 6, 2020

this is as expected now; a single string is inferred to object type; so you have to be explicit about the dtypes

Don't we have logic about whether an existing block can hold a new value? Or does that not work for scalars?

no, when u assign a new column he dtype is completely new

however with .loc it does consider (though i think we may have changed this)

@ghost
Copy link
Author

ghost commented Feb 6, 2020

@jreback I tried df['A'] = pd.array(['test'], dtype='string'), but I got ValueError: Length of values does not match length of index.

@jorisvandenbossche
Copy link
Member

however with .loc it does consider (though i think we may have changed this)

I thought as well (or at least I seem to remember discussion about this), but is seems loc also does full type inference again:

In [48]: df = pd.DataFrame({'A': [0.1,0.2,0.3]}) 

In [49]: df.dtypes  
Out[49]: 
A    float64
dtype: object

In [50]: df.loc[:, 'A'] = 1

In [51]: df.dtypes
Out[51]: 
A    int64
dtype: object

That would have been a nice workaround otherwise ..

@jreback
Copy link
Contributor

jreback commented Feb 7, 2020

yeah we changed this a while back to make .loc[:] behave the same as setitem

@jorisvandenbossche
Copy link
Member

If you specify explicitly the full slice (with start and/or end) instead of the implicit full slice (:), then it actually preserves the data type. For example with iloc:

In [25]: df = pd.DataFrame({'A': [0.1,0.2,0.3]})   

In [26]: df.dtypes   
Out[26]: 
A    float64
dtype: object

# df.iloc[0:, 0] also works
In [27]: df.iloc[0:-1, 0] = 1       

In [28]: df.dtypes   
Out[28]: 
A    float64
dtype: object

Or with loc if you know the labels (or could do something like df.index[0]:df.index[1]):

In [31]: df.loc[0:, 'A'] = 1   

In [32]: df.dtypes 
Out[32]: 
A    float64
dtype: object

(BTW, in retrospect, such corner cases seem to indicate to me that the change of letting .loc[:, 'col'] do type inference like getitem was maybe not the best idea ..)


Now, this doesn't work yet for strings, as there is a bug in setitem .. -> #31772

@glyg
Copy link
Contributor

glyg commented Feb 14, 2020

Hi, I'm posting this here because it seems related. since 1.0.1, loc doesn't preserve data types with a single index from a mixed datatypes dataframe:

import pandas as pd
import numpy as np

not_preserved = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list("ABC"))
not_preserved['F'] = np.random.random(8)

print("Mixed datatype DataFrame")
print(not_preserved.dtypes)

print("Single value loc: ")
print(not_preserved.loc[0, ['A', 'B', 'C']].dtypes)

print("Single value through slice: ")
print(not_preserved.loc[0:0, ['A', 'B', 'C']].dtypes)

# This prints:
# Mixed datatype DataFrame
# A      int64
# B      int64
# C      int64
# F    float64
# dtype: object
# Single value loc: 
# float64
# Single value through slice: 
# A    int64
# B    int64
# C    int64
# dtype: object

Should I open an issue, or is this an different manifestation of the problem discussed here?

@glyg
Copy link
Contributor

glyg commented Feb 14, 2020

It is true also in 0.24.1, it seems that what changed was the behavior of the replace method downstream in my code.

@Dr-Irv
Copy link
Contributor

Dr-Irv commented Sep 9, 2020

Should I open an issue, or is this an different manifestation of the problem discussed here?

@glyg I suggest opening up a new issue.

@Dr-Irv Dr-Irv added the Strings String extension data type and string data label Sep 9, 2020
@mroeschke mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action labels Jul 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

6 participants