Description
Introduction
Basically, the pandas.read_*
methods and constructors are awesome at assigning the highest level of dtype that can include all values of a column. But such functionality is lacking for DataFrames created by other methods (stack, unstack are prime examples).
There has been a lot of discussion about dtypes here (ref. #9216, #5902 and especially #9589), and I understand it is a well rehearsed topic, but with no general consensus. An unfortunate result of those discussions was the deprecation of the .convert_objects
method for being too forceful. However, the undercurrent in those discussions (IMHO) point to, and my needs often require a (DataFrame and Series) method which will intelligently assign the lowest generic dtype based on the data.
The method may optionally take a list of dtypes or a dictionary of column names, dtypes to assign user specified dtypes. Note that I am proposing this in addition to the existing to_*
methods. The following example will help illustrate:
In [1]: df = pd.DataFrame({'c1' : list('AAABBBCCC'),
'c2' : list('abcdefghi'),
'c3' : np.random.randn(9),
'c4' : np.arange(9)})
df.dtypes
Out[1]: c1 object
c2 object
c3 float64
c4 int64
dtype: object
In [2]: df = df.stack().unstack()
df.dtypes
Out[2]: c1 object
c2 object
c3 object
c4 object
dtype: object
Expected Output
Define a method .set_dtypes
which does the following:
- Either takes a boolean keyword argument
infer
to infer and reset the column dtype to the least general dtype such that values are not lost. - Or takes a list or dictionary of dtypes to force each column into user specified dtypes, with an optional
errors
keyword argument to handle casting errors.
As illustrated below:
In [3]: df.set_dtypes(infer=True).dtypes
Out[3]: c1 object
c2 object
c3 float64
c4 int64
dtype: object
In [4]: df.set_dtypes(types=[np.int64]*4, errors='coerce').dtypes
Out[4]: c1 int64
c2 int64
c3 int64
c4 int64
dtype: object
In [5]: df.set_dtypes(types=[np.int64]*4, errors='coerce') # Note loss of data
Out[5]: c1 c2 c3 c4
0 NaN NaN 1 0
1 NaN NaN 1 1
2 NaN NaN 0 2
3 NaN NaN 0 3
4 NaN NaN 0 4
5 NaN NaN 0 5
6 NaN NaN 2 6
7 NaN NaN 0 7
8 NaN NaN 1 8
In [6]: df.set_dtypes(types=[np.int64]*4, errors='ignore').dtypes
Out[6]: c1 object
c2 object
c3 object
c4 int64
dtype: object
Additional Notes
I understand that date and time types will be a little difficult to infer. However, following the logic powering pandas.read_*
, date and time types are not automatically inferred, but explicitly passed by the user.
It would be a one-size-fits-all solution if users were allowed to pass True
, and False
in addition to dtype to force when specifying dtypes per column. True
in this case would indicate infer automatically (set the best dtype), while False
would indicate ignore column from conversion.