You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using ProfileReport on dataframes with multi-index columns cause a "TypeError: Setting a MultiIndex dtype to anything other than object is not supported." error due to the code trying to cast the multiindex as a string.
The code revealed in the exception trace causing this is df.columns = df.columns.astype("str"):
~/.local/lib/python3.9/site-packages/pandas_profiling/model/pandas/dataframe_pandas.py in pandas_preprocess(config, df)
42 # Ensure that columns are strings
---> 43 df.columns = df.columns.astype("str")
44 return df
pip: If you are using pip, run pip freeze in your environment and report the results. The list of packages can be rather long, you can use the snippet below to collapse the output.
Sure, I was tempted, but I'm very unfamiliar with this pandas-profiling code base and only a novice pandas user, which is likely why my suggested fix is actually a poor idea now that I've properly considered it. Setting df.columns = ['.'.join(c) for c in df.columns] modify the original dataframe passed to ProfileReport(), as dataframes are passed to functions by reference in a mutable state.
I think df.columns = df.columns.astype("str") luckily seldom has a bad side-effect because non multi-index columns are usually of type str anyhow, but I suspect it's sign of limitations in logic elsewhere in the modules code not being able to deal with or report on multi-index dataframes. df.columns = df.columns.astype("str") is itself also bad practice even if benign. Users would not expect a reporting type function to end up mutating the object given to it.
Somewhere, there will need to be the logic to convert non-string column index values into strings for reporting or whatever in a way that does not affect the source dataframe.
I confirmed my understanding that it likely would be a bad side effect. E.g. run this and notice how the column names are mutated by the function.
import pandas as pd
def fiddle_col(df):
df.columns = [f'mod_{c}' for c in df.columns]
src_df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
print(src_df)
fiddle_col(src_df)
print(src_df)
Indeed the observation that no side-effects is better in many cases, the other side of the coin is that keeping a copy of the dataframe in memory is often not feasible. Currently, the DataFrame is not guaranteed to not be mutated. However, the user can simply pass a copy of the dataframe to profile.
Describe the bug
Using ProfileReport on dataframes with multi-index columns cause a "TypeError: Setting a MultiIndex dtype to anything other than object is not supported." error due to the code trying to cast the multiindex as a string.
The code revealed in the exception trace causing this is
df.columns = df.columns.astype("str")
:Perhaps it could be fixed with something like.
Possibly add/log a warning the multi-index columns get flattened.
To Reproduce
I often use the following dict/dataframe to test the extent of how well something can handle various types and none/empty values.
Code:
Version information:
pip
: If you are usingpip
, runpip freeze
in your environment and report the results. The list of packages can be rather long, you can use the snippet below to collapse the output.Click to expand Version information
Additional context
Full error stack trace message after attempting to output with
df_profile
:Work around
Flatten multi-index and then it works fine.
The text was updated successfully, but these errors were encountered: