Skip to content

ENH: standardize fill_value behavior across the API #15533

Open
@ResidentMario

Description

@ResidentMario

Problem

In the PR for #15486, I found that type validation for the fill_value parameters strewn across a large number of pandas API methods is done ad-hoc. This results in a wide variety of possible accepted inputs. I think it would be good to standardize this so that all of these methods use the same behavior, the one currently used by fillna.

Implementation Details

Partially the point of providing a fill_value is to avoid having to do a slow-down type conversion otherwise (using .fillna().astype()). However, specifying other formats is nevertheless a useful convenience to have. Implementation would roughly be:

Before executing the rest of the method body, check whether or not the fill_value is valid (using a centralized maybe_fill method). If it is not, throw a ValueError. If it is, check whether or not incorporating the fill_value would result in an upcast in the column dtype. If it would not, follow a code path where the column never gets type-converted. If it would, follow that same code path, then do something like a filla operation at the end before returning.

Target Implementation

The same as what fillna currently does. Which follows.

Invalid:

  • categorical fill for a category not in the categories will raise a ValueError.
  • sparse matrices refuse upcasting.
  • Passing an object or list or other non-coercable "thing" as a fill.

Valid, upcast:

  • int fill will promote bool dtypes to int.
  • float fill will promote int and bool dtypes to float (this is what happens with np.nan already).
  • object (str) fill would promote lesser dtypes to object.
  • int, float, and bool fill to a datetime dtype will be treated as a UNIX-like timestamp and promoted to datetime.
  • object fill will promote datetime dtype to object.

Valid, no-cast:

  • Everything else.

Current Implementation

...is ad-hoc. The following are the methods which currently provide a fill_value input, as well as where they deviate from the model above.

  • Series.combine, DataFrame.combine, Series.to_sparse: These are unique usages of fill_value which aren't compatible with the rest of them.

  • Series.unstack, DataFrame.unstack: any fill_value is allowed. You can pass an object if you'd like, or even another DataFrame (yo dawg...).

  • DataFrame.align: Any fill_value is allowed.

  • DataFrame.reindex_axis: Lists and dicts are allowed, objects are not.

  • DataFrame.asfreq, Series.asfreq: any fill_value is allowed.

  • pd.pivot_table: ...

  • Series.add, DataFrame.add: ...

  • Series.subtract, DataFrame.substract: ...

  • Probably others, there's a lot of these.

Metadata

Metadata

Assignees

No one assigned

    Labels

    API - ConsistencyInternal Consistency of API/BehaviorDtype ConversionsUnexpected or buggy dtype conversionsEnhancementError ReportingIncorrect or improved errors from pandasMissing-datanp.nan, pd.NaT, pd.NA, dropna, isnull, interpolate

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions