|
| 1 | +# PDEP-6: Ban upcasting in setitem-like operations |
| 2 | + |
| 3 | +- Created: 23 December 2022 |
| 4 | +- Status: Draft |
| 5 | +- Discussion: [#50402](https://github.com/pandas-dev/pandas/pull/50402) |
| 6 | +- Author: [Marco Gorelli](https://github.com/MarcoGorelli) ([original issue](https://github.com/pandas-dev/pandas/issues/39584) by [Joris Van den Bossche](https://github.com/jorisvandenbossche)) |
| 7 | +- Revision: 1 |
| 8 | + |
| 9 | +## Abstract |
| 10 | + |
| 11 | +The suggestion is that setitem-like operations would |
| 12 | +not change a ``Series``' dtype. |
| 13 | + |
| 14 | +Current behaviour: |
| 15 | +```python |
| 16 | +In [1]: ser = pd.Series([1, 2, 3], dtype='int64') |
| 17 | + |
| 18 | +In [2]: ser[2] = 'potage' |
| 19 | + |
| 20 | +In [3]: ser # dtype changed to 'object'! |
| 21 | +Out[3]: |
| 22 | +0 1 |
| 23 | +1 2 |
| 24 | +2 potage |
| 25 | +dtype: object |
| 26 | +``` |
| 27 | + |
| 28 | +Suggested behaviour: |
| 29 | + |
| 30 | +```python |
| 31 | +In [1]: ser = pd.Series([1, 2, 3]) |
| 32 | + |
| 33 | +In [2]: ser[2] = 'potage' # raises! |
| 34 | +--------------------------------------------------------------------------- |
| 35 | +TypeError: Invalid value 'potage' for dtype int64 |
| 36 | +``` |
| 37 | + |
| 38 | +## Motivation and Scope |
| 39 | + |
| 40 | +Currently, pandas is extremely flexible in handling different dtypes. |
| 41 | +However, this can potentially hide bugs, break user expectations, and unnecessarily copy data. |
| 42 | + |
| 43 | +An example of it hiding a bug is: |
| 44 | +```python |
| 45 | +In [9]: ser = pd.Series(pd.date_range('2000', periods=3)) |
| 46 | + |
| 47 | +In [10]: ser[2] = '2000-01-04' # works, is converted to datetime64 |
| 48 | + |
| 49 | +In [11]: ser[2] = '2000-01-04x' # almost certainly a typo - but pandas doesn't error, it upcasts to object |
| 50 | +``` |
| 51 | + |
| 52 | +The scope of this PDEP is limited to setitem-like operations which would operate inplace, such as: |
| 53 | +- ``ser[0] == 2``; |
| 54 | +- ``ser.fillna(0, inplace=True)``; |
| 55 | +- ``ser.where(ser.isna(), 0, inplace=True)`` |
| 56 | + |
| 57 | +There may be more. What is explicitly excluded from this PDEP is any operation would have no change |
| 58 | +of operating inplace to begin with, such as: |
| 59 | +- ``ser.diff()``; |
| 60 | +- ``pd.concat([ser, other])``; |
| 61 | +- ``ser.mean()``. |
| 62 | + |
| 63 | +These would keep being allowed to change Series' dtypes. |
| 64 | + |
| 65 | +## Detailed description |
| 66 | + |
| 67 | +Concretely, the suggestion is: |
| 68 | +- if a ``Series`` is of a given dtype, then a ``setitem``-like operation should not change its dtype; |
| 69 | +- if a ``setitem``-like operation would previously have changed a ``Series``' dtype, it would now raise. |
| 70 | + |
| 71 | +For a start, this would involve: |
| 72 | + |
| 73 | +1. changing ``Block.setitem`` such that it doesn't have an ``except`` block in |
| 74 | + |
| 75 | + ```python |
| 76 | + value = extract_array(value, extract_numpy=True) |
| 77 | + try: |
| 78 | + casted = np_can_hold_element(values.dtype, value) |
| 79 | + except LossySetitemError: |
| 80 | + # current dtype cannot store value, coerce to common dtype |
| 81 | + nb = self.coerce_to_target_dtype(value) |
| 82 | + return nb.setitem(indexer, value) |
| 83 | + else: |
| 84 | + ``` |
| 85 | + |
| 86 | +2. making a similar change in ``Block.where``, ``EABlock.setitem``, ``EABlock.where``, and probably more places. |
| 87 | + |
| 88 | +The above would already require several hundreds of tests to be adjusted. |
| 89 | + |
| 90 | +### Ban upcasting altogether, or just upcasting to ``object``? |
| 91 | + |
| 92 | +The trickiest part of this proposal concerns what to do when setting a float in an integer column: |
| 93 | + |
| 94 | +```python |
| 95 | +In [1]: ser = pd.Series([1, 2, 3]) |
| 96 | + |
| 97 | +In [2]: ser[0] = 1.5 |
| 98 | +``` |
| 99 | + |
| 100 | +Possibly options could be: |
| 101 | +1. just raise; |
| 102 | +2. convert the float value to ``int``, preserve the Series' dtype; |
| 103 | +3. upcast to ``float``, even if upcasting in setitem-like is banned for other conversions. |
| 104 | + |
| 105 | +Let us compare with what other libraries do: |
| 106 | +- ``numpy``: option 2 |
| 107 | +- ``cudf``: option 2 |
| 108 | +- ``polars``: option 2 |
| 109 | +- ``R data.frame``: option 3 |
| 110 | +- ``pandas`` (nullable dtype): option 1 |
| 111 | + |
| 112 | +If the objective of this PDEP is to prevent bugs, then option 2 is also not desirable: |
| 113 | +someone might set ``1.5`` and later be surprised to learn that they actually set ``1``. |
| 114 | + |
| 115 | +Option ``3`` would be inconsistent with the nullable dtypes' behaviour, would add complexity |
| 116 | +to the codebase and to tests, and would be confusing to teach. |
| 117 | + |
| 118 | +Option ``1`` is the maximally safe one in terms of protecting users from bugs, and would |
| 119 | +also be consistent with the current behaviour of nullable dtypes. It would also be simple to teach: |
| 120 | +"if you try to set an element of a ``Series`` to a new value, then that value must be compatible |
| 121 | +with the Series' dtype, otherwise it will raise" is easy to understand. If we make an exception for |
| 122 | +``int`` to ``float`` (and presumably also for ``interval[int]``, ``interval[float]``), then the rule |
| 123 | +starts to become confusing. |
| 124 | + |
| 125 | +## Usage and Impact |
| 126 | + |
| 127 | +This would make pandas stricter, so there should not be any risk of introducing bugs. If anything, this would help prevent bugs. |
| 128 | + |
| 129 | +Unfortunately, it would also risk annoy users who might have been intentionally upcasting. |
| 130 | + |
| 131 | +Given that users can get around this as simply as with an ``.astype({'my_column': float})`` call, |
| 132 | +I think it would be more beneficial to the community at large to err on the side of strictness. |
| 133 | + |
| 134 | +## Timeline |
| 135 | + |
| 136 | +Deprecate sometime in the 2.x releases (after 2.0.0 has already been released), and enforce in 3.0.0. |
| 137 | + |
| 138 | +### PDEP History |
| 139 | + |
| 140 | +- 23 December 2022: Initial draft |
0 commit comments