Skip to content

API/DEPR: int downcasting in DataFrame.where #44597

Closed
@jbrockmendel

Description

@jbrockmendel

Block.where has special downcasting logic that splits blocks differently from any other Block methods. I would like to deprecate and eventually remove this bespoke logic.

The relevant logic is only reached AFAICT when we have integer dtype (non-int64) and an integer other too big for this dtype, AND the passed cond has all-True columns.

(Identifying the affected behavior is difficult in part because it relies on can_hold_element incorrectly returning True in these cases)

import numpy as np
import pandas as pd

arr = np.arange(6).astype(np.int16).reshape(3, 2)
df = pd.DataFrame(arr)

mask = np.zeros(arr.shape, dtype=bool)
mask[:, 0] = True

res = df.where(mask, 2**17)

>>> res.dtypes
0    int16
1    int32
dtype: object

The simplest thing to do would be to not do any downcasting in these cases, in which case we would end up with all-int32. The next simplest would be to downcast column-wise, which would give the same end result but with less consolidation.

We do not have any test cases that fail if I disable this downcasting (after I fix a problem with an expressions.where call that the downcasting somehow makes irrelevant). This makes me think the current behavior is not intentional, or at least not a priority.

Any objection to deprecating the integer downcasting entirely?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions