Unable to handle Multi-Index / multi-level columns? #838

JPvRiel · 2021-10-03T21:00:13Z

Describe the bug

Using ProfileReport on dataframes with multi-index columns cause a "TypeError: Setting a MultiIndex dtype to anything other than object is not supported." error due to the code trying to cast the multiindex as a string.

The code revealed in the exception trace causing this is df.columns = df.columns.astype("str"):

~/.local/lib/python3.9/site-packages/pandas_profiling/model/pandas/dataframe_pandas.py in pandas_preprocess(config, df)
     42     # Ensure that columns are strings
---> 43     df.columns = df.columns.astype("str")
     44     return df

Perhaps it could be fixed with something like.

if isinstance(df.columns, pd.core.indexes.multi.MultiIndex):
    df.columns = ['.'.join(c) for c in df.columns]
df.columns = df.columns.astype("str")

Possibly add/log a warning the multi-index columns get flattened.

To Reproduce

I often use the following dict/dataframe to test the extent of how well something can handle various types and none/empty values.

Code:

import pandas as pd
import numpy as np
from pandas_profiling import ProfileReport

d = {
    ('simple', 'int'): [1, 2, 3, 4],
    ('simple', 'float'): [0.1, 0.2, 0.3, 0.4],
    ('simple', 'str'): ['one', 'two', 'three', 'four'],
    ('complex', 'obj'): [
        {'k1': 1},
        {'k2': 2},
        {'k3': 3},
        {'k3': 4}
    ],
    ('complex', 'arr_num_sym'): [
        [1.1, 1.2],
        [2.1, 2.2],
        [3.1, 3.2],
        [4.1, 4.2]
    ],
    ('complex', 'arr_obj_asym'): [
        [
            {'1k1': 1.1},
        ],
        [
            {'2k1': 2.1},
            {'2k2': 2.3}
        ],
        [],
        []
    ],
    ('complex', 'mixed'): [
        None,
        False,
        'string',
        [
            None,
            # bools
            True,
            False,
            # strings
            'string',
            '',
            # numbers
            42.7,
            np.NaN,
            # times
            np.datetime64('2021-08-05T18:23:49.705115547+02:00'),
            np.datetime64('NaT'),
            np.timedelta64(1, 'h'),
            np.timedelta64('NaT'),
            # nested objects
            {'k2': 2.1},
            # sequences
            [
                {'n3k1': 3.1},
                {'n3k2': 3.2},
            ],
            # empty sequences
            [],
            [[], []]
        ]
    ],
    ('nothing', 'nan'): [np.NaN, np.NaN, np.NaN, np.NaN],
    ('nothing', 'nat'): [np.datetime64('NaT') for i in range(4)],
    ('nothing', 'null'): [None, None, None, None],
    ('nothing', 'empty_str'): ['', '', '', ''],
    ('nothing', 'empty_arr'): [[], [], [], []],
    ('nothing', 'mixed'): [np.NaN, None, '', []],
}
df = pd.DataFrame(d)
profile = ProfileReport(df, 'Various data types and emptyness')
profile

Version information:

Python version: 3.9.7
Environment: Jupyter Notebook (local) in vscode
pip: If you are using pip, run pip freeze in your environment and report the results. The list of packages can be rather long, you can use the snippet below to collapse the output.

Click to expand Version information

argon2-cffi==21.1.0
astroid==2.8.0
async-generator==1.10
attrs==21.2.0
backcall==0.2.0
bleach==4.1.0
blessings==1.7
bokeh==2.4.0
bpython==0.21
certifi==2021.5.30
cffi==1.14.6
charset-normalizer==2.0.4
chart-studio==1.1.0
click==8.0.1
cloudpickle==2.0.0
colorlover==0.3.0
cufflinks==0.17.3
curtsies==0.3.5
cwcwidth==0.1.4
cycler==0.10.0
dask==2021.9.1
debugpy==1.4.3
decorator==5.1.0
defusedxml==0.7.1
distributed==2021.9.1
entrypoints==0.3
filelock==3.2.0
fsspec==2021.10.0
greenlet==1.1.1
grpcio==1.41.0
HeapDict==1.0.1
htmlmin==0.1.12
idna==3.2
ImageHash==4.2.1
ipykernel==6.4.1
ipympl==0.7.0
ipython==7.27.0
ipython-genutils==0.2.0
ipywidgets==7.6.4
isort==5.9.3
jedi==0.18.0
Jinja2==3.0.1
joblib==1.0.1
jsonschema==3.2.0
jupyter==1.0.0
jupyter-client==7.0.2
jupyter-console==6.4.0
jupyter-core==4.7.1
jupyterlab-pygments==0.1.2
jupyterlab-widgets==1.0.1
kiwisolver==1.3.2
lazy-object-proxy==1.6.0
locket==0.2.1
MarkupSafe==2.0.1
matplotlib==3.4.3
matplotlib-inline==0.1.3
mccabe==0.6.1
missingno==0.5.0
mistune==0.8.4
modin==0.11.0
msgpack==1.0.2
multimethod==1.6
nbclient==0.5.4
nbconvert==6.1.0
nbformat==5.1.3
nest-asyncio==1.5.1
networkx==2.6.3
notebook==6.4.3
numpy==1.21.2
packaging==21.0
pandas==1.3.3
pandas-profiling==3.1.0
pandocfilters==1.4.3
parso==0.8.2
partd==1.2.0
pexpect==4.8.0
phik==0.12.0
pickleshare==0.7.5
Pillow==8.3.2
pip-autoremove==0.9.1
platformdirs==2.3.0
plotly==5.3.1
prometheus-client==0.11.0
prompt-toolkit==3.0.20
protobuf==3.18.0
psutil==5.8.0
ptyprocess==0.7.0
pycparser==2.20
pydantic==1.8.2
Pygments==2.10.0
pylint==2.11.1
pyparsing==2.4.7
pyrsistent==0.18.0
python-dateutil==2.8.2
pytz==2021.1
PyWavelets==1.1.1
pyxdg==0.27
PyYAML==5.4.1
pyzmq==22.2.1
qtconsole==5.1.1
QtPy==1.11.0
ray==1.6.0
redis==3.5.3
requests==2.26.0
retrying==1.3.3
scikit-learn==1.0
scipy==1.7.1
screeninfo==0.7
seaborn==0.11.2
Send2Trash==1.8.0
six==1.16.0
sklearn==0.0
sortedcontainers==2.4.0
tangled-up-in-unicode==0.1.0
tblib==1.7.0
tenacity==8.0.1
terminado==0.12.1
testpath==0.5.0
threadpoolctl==2.2.0
toml==0.10.2
toolz==0.11.1
tornado==6.1
tqdm==4.62.3
traitlets==5.1.0
typing-extensions==3.10.0.2
urllib3==1.26.6
visions==0.7.4
wcwidth==0.2.5
webencodings==0.5.1
widgetsnbextension==3.5.1
wrapt==1.12.1
zict==2.0.0

Additional context

Full error stack trace message after attempting to output with df_profile:

DispatchError: Function <code object pandas_preprocess at 0x7ff2e95eec90, file "/home/enigma/.local/lib/python3.9/site-packages/pandas_profiling/model/pandas/dataframe_pandas.py", line 17>
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/.local/lib/python3.9/site-packages/multimethod/__init__.py in __call__(self, *args, **kwargs)
    302         try:
--> 303             return func(*args, **kwargs)
    304         except TypeError as ex:

~/.local/lib/python3.9/site-packages/pandas_profiling/model/pandas/dataframe_pandas.py in pandas_preprocess(config, df)
     42     # Ensure that columns are strings
---> 43     df.columns = df.columns.astype("str")
     44     return df

~/.local/lib/python3.9/site-packages/pandas/core/indexes/multi.py in astype(self, dtype, copy)
   3647         elif not is_object_dtype(dtype):
-> 3648             raise TypeError(
   3649                 "Setting a MultiIndex dtype to anything other than object "

TypeError: Setting a MultiIndex dtype to anything other than object is not supported

The above exception was the direct cause of the following exception:

DispatchError                             Traceback (most recent call last)
~/.local/lib/python3.9/site-packages/IPython/core/formatters.py in __call__(self, obj)
    343             method = get_real_method(obj, self.print_method)
    344             if method is not None:
--> 345                 return method()
    346             return None
    347         else:

~/.local/lib/python3.9/site-packages/pandas_profiling/profile_report.py in _repr_html_(self)
    416     def _repr_html_(self) -> None:
    417         """The ipython notebook widgets user interface gets called by the jupyter notebook."""
--> 418         self.to_notebook_iframe()
    419 
    420     def __repr__(self) -> str:

~/.local/lib/python3.9/site-packages/pandas_profiling/profile_report.py in to_notebook_iframe(self)
    396         with warnings.catch_warnings():
    397             warnings.simplefilter("ignore")
--> 398             display(get_notebook_iframe(self.config, self))
    399 
    400     def to_widgets(self) -> None:

~/.local/lib/python3.9/site-packages/pandas_profiling/report/presentation/flavours/widget/notebook.py in get_notebook_iframe(config, profile)
     73         output = get_notebook_iframe_src(config, profile)
     74     elif attribute == IframeAttribute.srcdoc:
---> 75         output = get_notebook_iframe_srcdoc(config, profile)
     76     else:
     77         raise ValueError(

~/.local/lib/python3.9/site-packages/pandas_profiling/report/presentation/flavours/widget/notebook.py in get_notebook_iframe_srcdoc(config, profile)
     27     width = config.notebook.iframe.width
     28     height = config.notebook.iframe.height
---> 29     src = html.escape(profile.to_html())
     30 
     31     iframe = f'<iframe width="{width}" height="{height}" srcdoc="{src}" frameborder="0" allowfullscreen></iframe>'

~/.local/lib/python3.9/site-packages/pandas_profiling/profile_report.py in to_html(self)
    366 
    367         """
--> 368         return self.html
    369 
    370     def to_json(self) -> str:

~/.local/lib/python3.9/site-packages/pandas_profiling/profile_report.py in html(self)
    183     def html(self) -> str:
    184         if self._html is None:
--> 185             self._html = self._render_html()
    186         return self._html
    187 

~/.local/lib/python3.9/site-packages/pandas_profiling/profile_report.py in _render_html(self)
    285         from pandas_profiling.report.presentation.flavours import HTMLReport
    286 
--> 287         report = self.report
    288 
    289         with tqdm(

~/.local/lib/python3.9/site-packages/pandas_profiling/profile_report.py in report(self)
    177     def report(self) -> Root:
    178         if self._report is None:
--> 179             self._report = get_report_structure(self.config, self.description_set)
    180         return self._report
    181 

~/.local/lib/python3.9/site-packages/pandas_profiling/profile_report.py in description_set(self)
    159     def description_set(self) -> Dict[str, Any]:
    160         if self._description_set is None:
--> 161             self._description_set = describe_df(
    162                 self.config,
    163                 self.df,

~/.local/lib/python3.9/site-packages/pandas_profiling/model/describe.py in describe(config, df, summarizer, typeset, sample)
     55 
     56     check_dataframe(df)
---> 57     df = preprocess(config, df)
     58 
     59     number_of_tasks = 5

~/.local/lib/python3.9/site-packages/multimethod/__init__.py in __call__(self, *args, **kwargs)
    303             return func(*args, **kwargs)
    304         except TypeError as ex:
--> 305             raise DispatchError(f"Function {func.__code__}") from ex
    306 
    307     def evaluate(self):

DispatchError: Function <code object pandas_preprocess at 0x7ff2e95eec90, file "/home/enigma/.local/lib/python3.9/site-packages/pandas_profiling/model/pandas/dataframe_pandas.py", line 17>

Work around

Flatten multi-index and then it works fine.

df.columns = ['.'.join(c) for c in df.columns]
profile = ProfileReport(df, 'Various data types and emptyness')
profile

The text was updated successfully, but these errors were encountered:

sbrugman · 2021-10-03T21:30:39Z

Thanks for the extensive bug report! Since you've already written a fix and test cases, I'd suggest sending in a pull request :)

(You might also be interested in Hacktoberfest, I've added a tag)

JPvRiel · 2021-10-04T19:48:04Z

I'd suggest sending in a pull request

Sure, I was tempted, but I'm very unfamiliar with this pandas-profiling code base and only a novice pandas user, which is likely why my suggested fix is actually a poor idea now that I've properly considered it. Setting df.columns = ['.'.join(c) for c in df.columns] modify the original dataframe passed to ProfileReport(), as dataframes are passed to functions by reference in a mutable state.

I think df.columns = df.columns.astype("str") luckily seldom has a bad side-effect because non multi-index columns are usually of type str anyhow, but I suspect it's sign of limitations in logic elsewhere in the modules code not being able to deal with or report on multi-index dataframes. df.columns = df.columns.astype("str") is itself also bad practice even if benign. Users would not expect a reporting type function to end up mutating the object given to it.

Somewhere, there will need to be the logic to convert non-string column index values into strings for reporting or whatever in a way that does not affect the source dataframe.

I confirmed my understanding that it likely would be a bad side effect. E.g. run this and notice how the column names are mutated by the function.

import pandas as pd

def fiddle_col(df):
   df.columns = [f'mod_{c}' for c in df.columns]

src_df = pd.DataFrame({'a': [1, 2], 'b': [3, 4]})
print(src_df)
fiddle_col(src_df)
print(src_df)

sbrugman · 2021-10-04T19:55:20Z

@JPvRiel Your isinstance check could be inn the right direction.

Here the index is added to the columns in case it's not a single column increasing index:
https://github.com/pandas-profiling/pandas-profiling/blob/develop/src/pandas_profiling/model/pandas/dataframe_pandas.py#L33

Indeed the observation that no side-effects is better in many cases, the other side of the coin is that keeping a copy of the dataframe in memory is often not feasible. Currently, the DataFrame is not guaranteed to not be mutated. However, the user can simply pass a copy of the dataframe to profile.

sbrugman added bug 🐛 Something isn't working Hacktoberfest 🎆 https://hacktoberfest.digitalocean.com/ labels Oct 3, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to handle Multi-Index / multi-level columns? #838

Unable to handle Multi-Index / multi-level columns? #838

JPvRiel commented Oct 3, 2021 •

edited

Loading

sbrugman commented Oct 3, 2021

JPvRiel commented Oct 4, 2021

sbrugman commented Oct 4, 2021

Unable to handle Multi-Index / multi-level columns? #838

Unable to handle Multi-Index / multi-level columns? #838

Comments

JPvRiel commented Oct 3, 2021 • edited Loading

sbrugman commented Oct 3, 2021

JPvRiel commented Oct 4, 2021

sbrugman commented Oct 4, 2021

JPvRiel commented Oct 3, 2021 •

edited

Loading