Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Building column partitions discards category dtype #2513

Open
dchigarev opened this issue Dec 4, 2020 · 1 comment
Open

Building column partitions discards category dtype #2513

dchigarev opened this issue Dec 4, 2020 · 1 comment
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests pandas concordance 🐼 Functionality that does not match pandas pandas.dataframe Related to pandas.dataframe module

Comments

@dchigarev
Copy link
Collaborator

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Any
  • Modin version (modin.__version__): c2e7f9e
  • Python version: 3.7.5
  • Code we can use to reproduce:
import os

# Setting Python engine to be able to see the printed output inside `applier`
os.environ["MODIN_ENGINE"] = "python"

import modin.pandas as pd
import numpy as np


def applier(df, other=None):
    print(f"df dtypes inside 'applier':\n{df.dtypes}\n")
    if other is not None:
        print(f"'other' dtypes inside 'applier':\n{other.dtypes}\n")


data = {f"col{i}": np.arange(256) for i in range(6)}
md_df = pd.DataFrame(data).astype({"col0": "category"})
other = md_df[["col1", "col2"]].astype({"col1": "category"})

print(f"proper 'df' dtypes:\n{md_df.dtypes}")
print(f"proper 'other' dtypes:\n{other.dtypes}\n")
md_df._query_compiler._modin_frame._apply_full_axis(axis=0, func=applier)
md_df._query_compiler._modin_frame.broadcast_apply_full_axis(
    axis=0, func=applier, other=other._query_compiler._modin_frame
)
Output
proper 'df' dtypes:
col0    category
col1       int64
col2       int64
col3       int64
col4       int64
col5       int64
dtype: object
proper 'other' dtypes:
col1    category
col2       int64
dtype: object

df dtypes inside 'applier':
col0    int64
col1    int64
col2    int64
col3    int64
col4    int64
col5    int64
dtype: object

df dtypes inside 'applier':
col0    int64
col1    int64
col2    int64
col3    int64
col4    int64
col5    int64
dtype: object

'other' dtypes inside 'applier':
col1    int64
col2    int64
dtype: object

Describe the problem

The problem is in how we're building column partitions and broadcasted frame inside deploy_axis_func and deploy_func_between_two_axis_partitions, we're just concating partitions and that leads to discarding category dtype:

>>> pd_df1
  col0  col1
0    a     0
1    a     1
>>> pd_df2
  col0  col1
0    b     2
1    a     3
>>> pd_df1.dtypes
col0    category
col1       int64
dtype: object
>>> pd_df2.dtypes
col0    category
col1       int64
dtype: object
>>> pandas.concat([pd_df1, pd_df2]).dtypes
col0    object  # category dtype were discarded
col1     int64
dtype: object

To save categories we can use the same approach we're using to do to_pandas in concatenate by union categories. Unioning categories slows down the building of column partitions by about 30% for frames that contain categories, however, if the frame already has computed dtypes then we can pass them into our concatenate function and reuse already unioned categories, which should not give a noticeable slowdown.

@dchigarev dchigarev added the bug 🦗 Something isn't working label Dec 4, 2020
@pyrito
Copy link
Collaborator

pyrito commented Aug 30, 2022

This problem still seems to exist on the latest master.

@pyrito pyrito added pandas concordance 🐼 Functionality that does not match pandas pandas.dataframe Related to pandas.dataframe module P2 Minor bugs or low-priority feature requests labels Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working P2 Minor bugs or low-priority feature requests pandas concordance 🐼 Functionality that does not match pandas pandas.dataframe Related to pandas.dataframe module
Projects
None yet
Development

No branches or pull requests

2 participants