Assigning scalars to nullable dataframe columns reverts to default dtypes. #31763

ghost · 2020-02-06T22:49:44Z

Code Sample, a copy-pastable example if possible

In [2]: df = pd.DataFrame({'A': ['a', 'b', 'c']}, dtype='string')

In [3]: df.dtypes
Out[3]:
A    string
dtype: object

In [4]: df['A'] = 'test'

In [5]: df.dtypes
Out[5]:
A    object
dtype: object

In [6]: df['A'] = pd.Series(['test'] * len(df), dtype='string', index=df.index)

In [7]: df
Out[7]:
      A
0  test
1  test
2  test

In [8]: df.dtypes
Out[8]:
A    string
dtype: object

# Don't need to set the index with pd.array()
In [9]: df['A'] = pd.array(['test'] * len(df) , dtype='string')

In [40]: df.dtypes
Out[40]:
A    string
dtype: object

Problem description

Because the new nullable datatypes are not yet the default, it's very tricky to convince dtypes to stay as nullable dtypes. Helpful shortcuts for assigning repeating values to a dataframe column infer the default types, even if the column is currently a nullable type, and the data being added can be represented in the same nullable type. I'm not sure if it's best just to wait until the nullable dtypes become the default, and live with the workarounds shown above for now, or if perhaps another keyword could be added to df.assign() to specify the intended dtype of a column.

Alternatively, is there any scope for adding dtype information to scalars?
Something like:

df['A']  = pd.typed('test', dtype='string')
# OR
df['A'] = pd.StringDtype('test')

The text was updated successfully, but these errors were encountered:

TomAugspurger · 2020-02-06T23:05:22Z

assign can't take additional kwargs. All the keywords are new column names.

We should be able to fix DataFrame.__setitem__ rather than introduce workarounds.

jreback · 2020-02-06T23:07:23Z

this is as expected now; a single string is inferred to object type; so you have to be explicit about the dtypes

-1 on making this more complex with options

this should work (though i am sure if we actually broadcast this)

df[‘A’] = pd.array([‘test’, dtype=‘string’)

TomAugspurger · 2020-02-06T23:08:10Z

this is as expected now; a single string is inferred to object type; so you have to be explicit about the dtypes

Don't we have logic about whether an existing block can hold a new value? Or does that not work for scalars?

jreback · 2020-02-06T23:09:48Z

this is as expected now; a single string is inferred to object type; so you have to be explicit about the dtypes

Don't we have logic about whether an existing block can hold a new value? Or does that not work for scalars?

no, when u assign a new column he dtype is completely new

however with .loc it does consider (though i think we may have changed this)

ghost · 2020-02-06T23:18:11Z

@jreback I tried df['A'] = pd.array(['test'], dtype='string'), but I got ValueError: Length of values does not match length of index.

jorisvandenbossche · 2020-02-07T07:25:17Z

however with .loc it does consider (though i think we may have changed this)

I thought as well (or at least I seem to remember discussion about this), but is seems loc also does full type inference again:

In [48]: df = pd.DataFrame({'A': [0.1,0.2,0.3]}) 

In [49]: df.dtypes  
Out[49]: 
A    float64
dtype: object

In [50]: df.loc[:, 'A'] = 1

In [51]: df.dtypes
Out[51]: 
A    int64
dtype: object

That would have been a nice workaround otherwise ..

jreback · 2020-02-07T11:32:47Z

yeah we changed this a while back to make .loc[:] behave the same as setitem

jorisvandenbossche · 2020-02-07T11:50:47Z

If you specify explicitly the full slice (with start and/or end) instead of the implicit full slice (:), then it actually preserves the data type. For example with iloc:

In [25]: df = pd.DataFrame({'A': [0.1,0.2,0.3]})   

In [26]: df.dtypes   
Out[26]: 
A    float64
dtype: object

# df.iloc[0:, 0] also works
In [27]: df.iloc[0:-1, 0] = 1       

In [28]: df.dtypes   
Out[28]: 
A    float64
dtype: object

Or with loc if you know the labels (or could do something like df.index[0]:df.index[1]):

In [31]: df.loc[0:, 'A'] = 1   

In [32]: df.dtypes 
Out[32]: 
A    float64
dtype: object

(BTW, in retrospect, such corner cases seem to indicate to me that the change of letting .loc[:, 'col'] do type inference like getitem was maybe not the best idea ..)

Now, this doesn't work yet for strings, as there is a bug in setitem .. -> #31772

glyg · 2020-02-14T11:07:10Z

Hi, I'm posting this here because it seems related. ~~since 1.0.1~~, loc doesn't preserve data types with a single index from a mixed datatypes dataframe:

import pandas as pd
import numpy as np

not_preserved = pd.DataFrame(np.arange(24).reshape((8, 3)), columns=list("ABC"))
not_preserved['F'] = np.random.random(8)

print("Mixed datatype DataFrame")
print(not_preserved.dtypes)

print("Single value loc: ")
print(not_preserved.loc[0, ['A', 'B', 'C']].dtypes)

print("Single value through slice: ")
print(not_preserved.loc[0:0, ['A', 'B', 'C']].dtypes)

# This prints:
# Mixed datatype DataFrame
# A      int64
# B      int64
# C      int64
# F    float64
# dtype: object
# Single value loc: 
# float64
# Single value through slice: 
# A    int64
# B    int64
# C    int64
# dtype: object

Should I open an issue, or is this an different manifestation of the problem discussed here?

glyg · 2020-02-14T11:17:43Z

It is true also in 0.24.1, it seems that what changed was the behavior of the replace method downstream in my code.

Dr-Irv · 2020-09-09T12:21:09Z

Should I open an issue, or is this an different manifestation of the problem discussed here?

@glyg I suggest opening up a new issue.

ghost mentioned this issue Feb 6, 2020

BUG: DataFrame.convert_dtypes fails on column that is already "string" dtype #31731

Closed

TomAugspurger mentioned this issue Jun 16, 2020

BUG: DataFrame.__setitem__ creates object-dtype array for extension type scalars #34832

Closed

3 tasks

Dr-Irv added the Strings String extension data type and string data label Sep 9, 2020

glyg mentioned this issue Sep 9, 2020

[BUG] loc does not preserve datatype with a single element #36247

Closed

mroeschke added Bug Indexing Related to indexing on series/frames, not to indexes themselves Needs Discussion Requires discussion from core team before further action labels Jul 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Assigning scalars to nullable dataframe columns reverts to default dtypes. #31763

Assigning scalars to nullable dataframe columns reverts to default dtypes. #31763

ghost commented Feb 6, 2020

TomAugspurger commented Feb 6, 2020

jreback commented Feb 6, 2020

TomAugspurger commented Feb 6, 2020 •

edited

Loading

jreback commented Feb 6, 2020

ghost commented Feb 6, 2020

jorisvandenbossche commented Feb 7, 2020

jreback commented Feb 7, 2020

jorisvandenbossche commented Feb 7, 2020

glyg commented Feb 14, 2020 •

edited

Loading

glyg commented Feb 14, 2020

Dr-Irv commented Sep 9, 2020

Assigning scalars to nullable dataframe columns reverts to default dtypes. #31763

Assigning scalars to nullable dataframe columns reverts to default dtypes. #31763

Comments

ghost commented Feb 6, 2020

Code Sample, a copy-pastable example if possible

Problem description

TomAugspurger commented Feb 6, 2020

jreback commented Feb 6, 2020

TomAugspurger commented Feb 6, 2020 • edited Loading

jreback commented Feb 6, 2020

ghost commented Feb 6, 2020

jorisvandenbossche commented Feb 7, 2020

jreback commented Feb 7, 2020

jorisvandenbossche commented Feb 7, 2020

glyg commented Feb 14, 2020 • edited Loading

glyg commented Feb 14, 2020

Dr-Irv commented Sep 9, 2020

TomAugspurger commented Feb 6, 2020 •

edited

Loading

glyg commented Feb 14, 2020 •

edited

Loading