Skip to content

Inconsistent columns in groupby_reduce result #1731

@dchigarev

Description

@dchigarev

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • Modin version (modin.__version__): 0.7.3+200.g8e7a5682
  • Python version: 3.7.5
  • Code we can use to reproduce:
if __name__ == "__main__":
    import modin.pandas as pd
    import pandas
    import numpy as np

    data = {
        "col1": [0, 1, 2, 3],
        "col2": [4, 5, np.NaN, 7],
        "col3": [np.NaN, np.NaN, 12, 10],
        "col4": [17, 13, 16, 15],
        "col5": [-4, -5, -6, -7],
    }

    md_df, pd_df = pd.DataFrame(data), pandas.DataFrame(data)

    groupby_kwargs = {"by": ["col5", "col4", "col1"], "as_index": False}

    md_result, pd_result = (
        md_df.groupby(**groupby_kwargs).any(),
        pd_df.groupby(**groupby_kwargs).any(),
    )

    print("pd_result:\n", pd_result, sep="")
    print("\nmd_result:\n", md_result, sep="")

    print("\npd_columns:", pd_result.columns)
    print("md_columns:", md_result.columns)
Output:
pd_result:
   col5  col4  col1   col2   col3
0    -7    15     3   True   True
1    -6    16     2  False   True
2    -5    13     1   True  False
3    -4    17     0   True  False

md_result:
   col5  col4  col1   col2   col3
0    -7    15     3   True   True
1    -6    16     2  False   True
2    -5    13     1   True  False
3    -4    17     0   True  False

pd_columns: Index(['col5', 'col4', 'col1', 'col2', 'col3'], dtype='object')
md_columns: Index(['col1', 'col2', 'col3', 'col4', 'col5'], dtype='object')

Describe the problem

Columns into partitions seems to be correct, but columns of dataframe itself isn't, that's also the reason why test don't fails on that test case (to_pandas that used in df_equals considers information only from partitions)

Metadata

Metadata

Assignees

Labels

bug 🦗Something isn't working

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions