Skip to content

Commit 06fcb4c

Browse files
author
MarcoGorelli
committed
[skip-ci] pdep6 draft
1 parent 9993ec4 commit 06fcb4c

File tree

1 file changed

+140
-0
lines changed

1 file changed

+140
-0
lines changed
Lines changed: 140 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,140 @@
1+
# PDEP-6: Ban upcasting in setitem-like operations
2+
3+
- Created: 23 December 2022
4+
- Status: Draft
5+
- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402)
6+
- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche))
7+
- Revision: 1
8+
9+
## Abstract
10+
11+
The suggestion is that setitem-like operations would
12+
not change a ``Series``' dtype.
13+
14+
Current behaviour:
15+
```python
16+
In [1]: ser = pd.Series([1, 2, 3], dtype='int64')
17+
18+
In [2]: ser[2] = 'potage'
19+
20+
In [3]: ser # dtype changed to 'object'!
21+
Out[3]:
22+
0 1
23+
1 2
24+
2 potage
25+
dtype: object
26+
```
27+
28+
Suggested behaviour:
29+
30+
```python
31+
In [1]: ser = pd.Series([1, 2, 3])
32+
33+
In [2]: ser[2] = 'potage' # raises!
34+
---------------------------------------------------------------------------
35+
TypeError: Invalid value 'potage' for dtype int64
36+
```
37+
38+
## Motivation and Scope
39+
40+
Currently, pandas is extremely flexible in handling different dtypes.
41+
However, this can potentially hide bugs, break user expectations, and unnecessarily copy data.
42+
43+
An example of it hiding a bug is:
44+
```python
45+
In [9]: ser = pd.Series(pd.date_range('2000', periods=3))
46+
47+
In [10]: ser[2] = '2000-01-04' # works, is converted to datetime64
48+
49+
In [11]: ser[2] = '2000-01-04x' # almost certainly a typo - but pandas doesn't error, it upcasts to object
50+
```
51+
52+
The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as:
53+
- ``ser[0] == 2``;
54+
- ``ser.fillna(0, inplace=True)``;
55+
- ``ser.where(ser.isna(), 0, inplace=True)``
56+
57+
There may be more. What is explicitly excluded from this PDEP is any operation would have no change
58+
of operating inplace to begin with, such as:
59+
- ``ser.diff()``;
60+
- ``pd.concat([ser, other])``;
61+
- ``ser.mean()``.
62+
63+
These would keep being allowed to change Series' dtypes.
64+
65+
## Detailed description
66+
67+
Concretely, the suggestion is:
68+
- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype;
69+
- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise.
70+
71+
For a start, this would involve:
72+
73+
1. changing ``Block.setitem`` such that it doesn't have an ``except`` block in
74+
75+
```python
76+
value = extract_array(value, extract_numpy=True)
77+
try:
78+
casted = np_can_hold_element(values.dtype, value)
79+
except LossySetitemError:
80+
# current dtype cannot store value, coerce to common dtype
81+
nb = self.coerce_to_target_dtype(value)
82+
return nb.setitem(indexer, value)
83+
else:
84+
```
85+
86+
2. making a similar change in ``Block.where``, ``EABlock.setitem``, ``EABlock.where``, and probably more places.
87+
88+
The above would already require several hundreds of tests to be adjusted.
89+
90+
### Ban upcasting altogether, or just upcasting to ``object``?
91+
92+
The trickiest part of this proposal concerns what to do when setting a float in an integer column:
93+
94+
```python
95+
In [1]: ser = pd.Series([1, 2, 3])
96+
97+
In [2]: ser[0] = 1.5
98+
```
99+
100+
Possibly options could be:
101+
1. just raise;
102+
2. convert the float value to ``int``, preserve the Series' dtype;
103+
3. upcast to ``float``, even if upcasting in setitem-like is banned for other conversions.
104+
105+
Let us compare with what other libraries do:
106+
- ``numpy``: option 2
107+
- ``cudf``: option 2
108+
- ``polars``: option 2
109+
- ``R data.frame``: option 3
110+
- ``pandas`` (nullable dtype): option 1
111+
112+
If the objective of this PDEP is to prevent bugs, then option 2 is also not desirable:
113+
someone might set ``1.5`` and later be surprised to learn that they actually set ``1``.
114+
115+
Option ``3`` would be inconsistent with the nullable dtypes' behaviour, would add complexity
116+
to the codebase and to tests, and would be confusing to teach.
117+
118+
Option ``1`` is the maximally safe one in terms of protecting users from bugs, and would
119+
also be consistent with the current behaviour of nullable dtypes. It would also be simple to teach:
120+
"if you try to set an element of a ``Series`` to a new value, then that value must be compatible
121+
with the Series' dtype, otherwise it will raise" is easy to understand. If we make an exception for
122+
``int`` to ``float`` (and presumably also for ``interval[int]``, ``interval[float]``), then the rule
123+
starts to become confusing.
124+
125+
## Usage and Impact
126+
127+
This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs.
128+
129+
Unfortunately, it would also risk annoy users who might have been intentionally upcasting.
130+
131+
Given that users can get around this as simply as with an ``.astype({'my_column': float})`` call,
132+
I think it would be more beneficial to the community at large to err on the side of strictness.
133+
134+
## Timeline
135+
136+
Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0.
137+
138+
### PDEP History
139+
140+
- 23 December 2022: Initial draft

0 commit comments

Comments
 (0)