Skip to content

[ENH] Need a best dtype method #14400

Closed
Closed
@dragonator4

Description

@dragonator4

Introduction

Basically, the pandas.read_* methods and constructors are awesome at assigning the highest level of dtype that can include all values of a column. But such functionality is lacking for DataFrames created by other methods (stack, unstack are prime examples).

There has been a lot of discussion about dtypes here (ref. #9216, #5902 and especially #9589), and I understand it is a well rehearsed topic, but with no general consensus. An unfortunate result of those discussions was the deprecation of the .convert_objects method for being too forceful. However, the undercurrent in those discussions (IMHO) point to, and my needs often require a (DataFrame and Series) method which will intelligently assign the lowest generic dtype based on the data.

The method may optionally take a list of dtypes or a dictionary of column names, dtypes to assign user specified dtypes. Note that I am proposing this in addition to the existing to_* methods. The following example will help illustrate:

In [1]: df = pd.DataFrame({'c1' : list('AAABBBCCC'),
                           'c2' : list('abcdefghi'),
                           'c3' : np.random.randn(9),
                           'c4' : np.arange(9)})
        df.dtypes
Out[1]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [2]: df = df.stack().unstack()
        df.dtypes
Out[2]: c1     object
        c2     object
        c3     object
        c4     object
        dtype: object

Expected Output

Define a method .set_dtypes which does the following:

  1. Either takes a boolean keyword argument infer to infer and reset the column dtype to the least general dtype such that values are not lost.
  2. Or takes a list or dictionary of dtypes to force each column into user specified dtypes, with an optional errors keyword argument to handle casting errors.

As illustrated below:

In [3]: df.set_dtypes(infer=True).dtypes
Out[3]: c1     object
        c2     object
        c3    float64
        c4      int64
        dtype: object

In [4]: df.set_dtypes(types=[np.int64]*4, errors='coerce').dtypes
Out[4]: c1     int64
        c2     int64
        c3     int64
        c4     int64
        dtype: object

In [5]: df.set_dtypes(types=[np.int64]*4, errors='coerce') # Note loss of data
Out[5]:     c1  c2  c3  c4
        0   NaN NaN 1   0
        1   NaN NaN 1   1
        2   NaN NaN 0   2
        3   NaN NaN 0   3
        4   NaN NaN 0   4
        5   NaN NaN 0   5
        6   NaN NaN 2   6
        7   NaN NaN 0   7
        8   NaN NaN 1   8

In [6]: df.set_dtypes(types=[np.int64]*4, errors='ignore').dtypes
Out[6]: c1     object
        c2     object
        c3     object
        c4      int64
        dtype: object

Additional Notes

I understand that date and time types will be a little difficult to infer. However, following the logic powering pandas.read_*, date and time types are not automatically inferred, but explicitly passed by the user.

It would be a one-size-fits-all solution if users were allowed to pass True, and False in addition to dtype to force when specifying dtypes per column. True in this case would indicate infer automatically (set the best dtype), while False would indicate ignore column from conversion.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions